DOCUMENT RESUME 



ED 403 863 



HE 029 945 



AUTHOR 

TITLE 



PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



Wangerin, Paul T. 

Lies; Damned Lies; Statistics; and Law School Grades. 
Grade Conferences from Hell: Measurement Error in Law 
School Grading. 

Jul 94 

106p. ; Paper presented at Annual Conference on The 
Science and Art of Law Teaching (Spokane, WA, July 
15-16, 1994). 

Speeches/Conference Papers (150) — Viewpoints 
(Opinion/Position Papers, Essays, etc.)’ (120) 

MF01/PC05 Plus Postage. 

Civil Rights; *Court Litigation; ^Educational 
Malpractice; *Error of Measurement; ^Evidence 
(Legal); Grade Prediction; Grades (Scholastic); 
*Grading; Higher Education; *Law Schools; Legal 
Problems; Psychometrics; Scoring; Student Rights; 
Teacher Made Tests; Testing Problems; Test 
Reliability; Test Validity 



ABSTRACT 

This paper addresses problems confronting law school 
teachers in grading law school exams and assigning letter grades. 
Using prototypical dialogue and scenarios, the paper examines 
mathematical and statistical issues that contribute to grading 
errors. Discussed in relation to real world data and the bar exam 
are: differential weighting, combining scores, test reliability, 
consistency in measurement, and standard error issues. The paper also 
reviews two sets of court cases. In so-called "academic challenge” 
cases, case law is clear — the burden of proof is on test-takers who 
must show that tests violate accepted norms. In ”high-stakes testing” 
on the other hand, the burden of proof is upon test-scorers, who must 
prove that tests comply with accepted academic norms. Since such 
cases often involve claims of civil rights, court rulings are more 
ambiguous. This raises the question of whether most law school grades 
are high-stakes tests or simply academic challenge situations. 
Appended to the paper are sample test questions which instructors can 
used to evaluate their own grading biases. Also appended is a 
chapter, "Constructing and Using Essay and Product Development Tests” 
from the book, "Measuring and Evaluating School Learning”, by Lou M. 
Carey. (Contains 50 references.) (CH) 



k Vf k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document. * 

k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k 





GONZAGA 

UNIVERSITY 



Institute For 

Law SchoolTeaching 

SCHOOL OF LAW • P.O. Box 3528, Spokane, Washington 99220-3528 • 509/328-4220 (Ext. 3740) 



CO 

oo 

CO 

o 



8 



The Science and Art of Law Teaching 

July 15-16, 1994 



Lies; Damned Lies; Statistics ; and Law School Grades 



by 



( 



Paul T. Wangerin 



Paul Wangerin is Associate Professor Law at The John Marshall Law School. He 
received an A.B. from the University of Missouri in 1969 and a J.D. with high honors from The 
John Marshall Law School. From 1978 to 1979 he clerked for the Illinois Supreme Court, and 
I from 1979 to 1982 he practiced law with Winston & Strawn in Chicago. Professor Wangerin 

teaches courses in Remedies and Contracts and principally writes about Remedies and legal 
Y> education issues. 

V - . 

N 




U S. DEPARTMENT OF EDUCATION 

Office of Educational Research and Improvement 

educational resources information 
f CENTER (ERIC) 

0 This document has been reproduced as 
received from the person or organization 
originating it. 

□ Minor changes have been made to improve 
reproduction quality 



e Points ol view or opinions stated in this docu- 
ment do not necessarily represent official 
OERI position or policy 



7 



“PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 

Paul T. Wangerin 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC).” 



2 



best copy available 



Grade Conferences From Hell: 
Error in Law School Grading 



Measurement 



Paul T. Wangerin 
John Marshall Law School 
Chicago, Illinois 



Abstract 

An enormous amount of anecdotal evidence, and at least some 
empirical evidence, suggests that many law school teachers make a 
number of serious "measurement errors" in connection with the 
grading of law school exams and the assignment of letter grades. 
Though some of these errors involve difficult "discretionary" 
issues, some involve essentially mathematical or statistical 
issues. The present paper discusses the second of those two kinds 
of error, the kind involving statistical or mathematical issues. 
The present paper also discusses two separate but parallel sets of 
judicial decisions. Both of these sets deal with judicial 
responses to grading disputes. One of those sets, a set dealing 
with what is sometimes called "high- stakes " testing, places the 
burden on test- scorers to prove that the tests involved comply with 
accepted academic norms. Conversely, the other of those sets, a 
set dealing with the notion of "academic challenges" to classroom 
grades, places the burden on test -takers to prove that the tests 
involved violate accepted academic norms. Law school grading, the 
analysis concludes, is a kind of "hybrid" situation, a situation 
existing somewhere between academic challenges and high stakes, 
testing. Thus, courts must deal with the kind of law school 
measurement errors described herein in some sort of hybrid fashion. 



Grade Conferences From Hell • 
Error in Law School Grading 



Measurement 



I . Introduction 

Most law school teachers have vague feelings of uneasiness 
regarding grading practices in the law schools. Most such teachers 
understand, for example, the awesome consequences of minor letter 
grade differences between the grades that they give to different 
students. The GPA's of students at the very top of a graduating 
class, . after all, often differ by only tenths, or even hundredths, 
of points . . Such minuscule differences, however, often lead to 
lif e- changing consequences for the students involved. Likewise, 
minuscule grade differences can lead to life-changing consequences 
for students at the bottom of a class. The difference between a 
1.99 GPA and a 2.00 GPA, and thus the difference between staying in 
school and flunking out, is minuscule. 



Man y lsw teachers also have vague feelings of uneasiness about 
the very process of grading exams, and of assigning letter grades. 
Many law teachers realize, for example-, that the grades they assign 
to essay type questions are inherently subjective. Further, many 
law school teachers probably acknowledge that the- standard single 
exam system of law school grading, a system wherein the grade for 
a course reflects nothing other than performance on a single final, 
is, at best, educationally problematic. 1 Finally, many law 
teachers probably acknowledge that the different letter grades that 
they give to students whose test scores differ only slightly 
perhaps do not reflect much in the way of real differences in 
performance. 

of these things, and other related ones, cause law school 
teachers to dread student grade conferences perhaps more than any 
other thing. "You received the most points that I could possibly 
give on that question," law teachers must constantly say to 
desperate students during such conferences. "Your paper was 
significantly worse than the next higher one up." While teachers 
say these things, however, they frequently are thinking just the 
opposite. "Of course I could have given an extra point for that 
answer, these teachers are thinking as they mouth opposite words . 

In fact, there probably was no real difference between those two 
exams despite the fact that I gave them different grades." 

Fortunately for most teachers, students aggrieved by grading 
decisions generally know very little about two things. First, 
these students generally know very little about the legal rights 
that they might have vis a vis their teachers and their schools. 
Second, these students generally know very little about the 
principles of educational testing and measurement . 



The following analysis addresses both of those shortcomings in 
most students' knowledge. First, building on widely accepted ideas 
from the world of educational measurement, the analysis suggests 
that the grades that many law school teachers give in probably are 
affected by a number of very serious "measurement" errors. 2 Those 
errors, the analysis suggests, involve (l) "weighting" issues, (2) 
"reliability" issues, and (3) "standard error of measurement" 
issues. Second, the analysis describes the difference in burdens of 
proof between "academic challenge" lawsuits involving traditional 
classroom grades and lawsuits involving the results on "high 
stakes" tests. Students have virtually no chance of success if 
they file academic challenge suits. Conversely, a substantial 
possibility of success exists for students in high stakes testing 
suits. Law school grading practices, the analysis then suggests, 
are a hybrid between traditional grading and high stakes testing. 
Thus, courts of law should be allowed to intervene in connection 
with the kind of measurement errors described herein. 

Three introductory points must now quickly be made. First, in 
connection with the preparation of the present analysis, this 
writer informally collected 13 sets of grades from 11 teachers at 
a large urban law school. While collecting these sets of grades, 
this writer made no attempt to make certain that these sets of 
grades, these teachers, and this school, are representative 
generally of law school grades , teachers and schools. Thus, it is 
possible that the real world grading problems described herein are 
entirely isolated. Having said that, however, it must also be 
noted that no reason exists, to think that the data collected is not 
representative of grades, teachers and schools generally. Thus, it 
is surely possible that the real world grading problems described 
herein are widespread. Second, although the following discussion 
regularly refers to ideas from the world of statistical analysis, 
readers who know nothing about that world have nothing to fear 
herein. Unlike some discussions of statistics, discussions which 
seem to revel in abstruseness, the present analysis strives for 
simplicity above all else. 3 

The third point yet to be made relates closely to the second. 
Although most law school teachers seem to have little knowledge of 
the kinds of measurement and statistical issues discussed herein, 
and thus make the kind of grading mistakes described, the ideas 
described herein are not totally foreign to the field of law. The 
Bar Examiner . for example, frequently publishes articles addressing 
the kinds of measurement and statistical ideas discussed herein. 4 
Further, and perhaps more importantly, as the last part of this 
analysis reveals, courts that have dealt with "high stakes" tests 
regularly address these issues. 

II . Measurement Error in Law School Grading 

Countless law students have argued in countless grade 
conferences that their teachers have "erred" by not giving them 



additional points on particular essay questions. And countless law 
students have argued in such conferences that their teachers have 
"erred" because the teachers gave too many C's or D's or the like. 
The present analysis completely ignores these kinds of alleged 
grading errors. It does so, in turn, for a straight forward 
reason. Grading errors of the kind just described are errors of 
judgment or discretion. Thus, obviously, errors of this kind 
cannot easily be quantified or turned into hard and fast numbers. 
Further, careful analysis might well reveal that errors of this 
kind are not errors at all. Rather, these errors might turn out to 
be nothing more than differences of opinion. 

The present analysis concentrates on an entirely different 
kinds of measurement error. These errors involve purely 
mathematical or statistical issues. Thus, discretion plays no role 
whatsoever here. Thus, if as demonstrated herein a teacher makes 
statistical errors when combining scores from different parts of 
exams, those teachers cannot simply point to discretionary 
decisions. If the numbers are wrong, then the numbers are wrong. 
Period. Likewise, if as demonstrated herein a teacher fails to 
employ a test that is sufficiently reliable, or fails to take into 
account an appropriate amount of measurement error, that teacher 
again cannot simply hide behind the notion of . discretion. Again, 
if the numbers are wrong, then the numbers are wrong. 

Note carefully now an important point. Although the present 
analysis addresses nothing other than statistical or mathematical 
types of grading errors, the errors discussed herein are by no 
means simple or obvious ones. The present paper, for example, 
spends no time on situations in which teachers simply miscalculate 
addition sums. And the present paper spends no time on situations 
in which teachers overlook the second of two blue books, or 
inadvertently fail to read an answer to a particular question. 
Rather, the present analysis concentrates on kinds of statistical 
or mathematical errors that measurement experts fully understand 
but that law school teachers for the most part know nothing about. 

Three different kinds of measurement error are described 
below. The first of those, herein called "weighting" problems, 
occurs when teachers total up the scores from different parts of 
exams. The second of those, dealing with the "reliability" of 
tests, involves questions about the likelihood that a particular 
test will produce consistent results. The third of these errors 
deals with the notion of the "standard error of measurement." 
Standard error of measurement problems occur when teachers assign 
different letter grades to students whose point totals differ by 
relatively small amounts.. 

A. Differential "Weighting" 

It is probably safe to say that the most dramatically obvious 
"measurement error" that law school teaphers routinely make 



V 



involves the. "weighting" of different parts of exams. 5 Teachers 
make this mistake, in turn, because they assume that they can 
simply add up the scores from different parts of exams. Teachers 
who give a test that contains four "equally weighted" essay 
questions, for example, usually assume that they can just add up 
the scores from those different questions and assign grades in 
light of the totals. . Likewise, teachers who give exams that 
combine essays with obj ective- type questions usually also assume 
that they can just add up point scores. As the following will 
demonstrate, however, such simple addition can lead to very, very 
serious mistakes. 

1 . Teacher l 

Assume that the following figure contains scores and grades 
that Teacher l gave . on a test that contained four "equally 
weighted" essay questions. (For convenience only, students here 
are identified by names rather than numbers.) 

Figure 1 






£1 


— 22 




#4 


TT 1 




JPHMS 


4 


2- 


? 


5 


17 


F 


D-IA2 


1 


4 


5 


4 


1* 


n 


rwpwr: 


k 


7 


2 


4 


17 


r 


PH 111- 








3 


s 


17 


r 


Pinss 


1 


L 


<3 


4 


70 


r + 


HOE EE 


4 


2- 


S 


S 


77 


R 


smith 


S 


k 


6 


6 


?7 


n 


ADAMS 


L 


7 


7 


5 


?7 


R 


idupr 


4 


IQ 


9 


5 




A 


upunT 


3- 


ID 


in 


5 




A 
















— MEAN : 


3.7 


6.4 


3.8 


4.8 


-2CL 7 





i 



\ 



Note carefully some of the details regarding these grades. 
First, Teacher l, like most teachers, has calculated the "mean" (or 
average) scores so that she can tell roughly where students stand 
in relation to each other. Further, Teacher l, like most 
teachers, has ranked her students by overall point totals, and 
assigned letter grades in light of those point totals. 

Imagine now that students Smith and Adams appear for a joint 
grade conference with Teacher l. Teacher l tells these two 
students that they both received the same number of points (23) , a 
number that put them somewhat above the middle of the class. These 
scores, the. teacher then tells these students, earned them grades 
that were just above average, namely, B's. But, the students 
press on, producing the first of a group of "grade conferences from 
hell." 

Smith: How did I do on the individual questions? 





4 



7 



/ 



Teacher 1: 


Well, Smith, on questions 1 and 4 you did pretty 
good. In fact, very good, at least in relation to 
other students. ■ On those two questions you got the 
highest scores that I gave, namely, 5 and 6. On 
questions 2 and 3, however, not so good. You got 
only the average scores, namely 6. 


Smith: 


So, I got the highest score given on half of the 
test, and the average score on the other half? 


Teacher 1: 


Yes, that's correct. 


Smith: 


And I still aot only the averacre arade for the 
course? 


Teacher 1: 


Yes, that's correct. 


Adams 


then cuts in. 


Adams : 


How did I do on the individual questions? 


Teacher 1: 


Well, on all of the questions you got about the 
average scores, namely, 4, 7, 7 and 5. 


Adams : 


So, I got the average score for the whole test? 


Teacher 1: 


That's correct. 


Adams : 


And I got a 3, the average grade 


Smith 


butts in. 


Smith: 


Wait a minute. Adams got average scores for all of 
the questions and he got a B. I got average scores 
on two questions but I got the top score on two 
questions. And I also got a B. That's crazy. 


Teacher 1: 


Well, you two got- the same number of points, so I 
had to give you the same grade. 


Smith: 


But you admit yourself that I wrote a much, much 
better exam than Adams 


Teacher 1 : 


I'm sorry. There's nothing I can do. Numbers 

don't lie. 



Smith and Adams then leave, and Teacher 1 begins to take what 
she hopes is a long break from students. But in walk Johns and 



Diaz, also 


study partners. And the nightmare continues. . 


O 

ERIC 


5 8 



5 



8 



Teacher 1 

Johns : 
Teacher l 

Johns : 

Teacher 1 : 
Johns : 
Teacher l : 
Johns : 

Diaz 

Diaz : 

Teacher 1 : 
Diaz : 

Teacher 1 : 
Johns 
Johns : 

Teacher l: 



Well, Johns, you received the lowest point total 
that I gave, namely 13 points. Hence, you got the 
lowest grade, F. Ms. Diaz, you got 15 points. 

Since this was not quite as bad as Johns, you got a 

Tell me, if you will, about my individual scores . 

Well, Johns, your score on questions 1 was 4 and on 
question 4 was 5. Those scores were just slightly 
above the average. But your score for question 2 
was 2 and for question 3 was 5. Those scores were 
well below average. 

So on half the test I scored slightly above average 
and on half I scored below average. 

Yes, that's correct. 

And yet I got the lowest grade in the class? 

Yes, that's correct. 



Let me make sure I understand this. On half the 
test I scored above average and I still got the 
lowest grade you gave? 

cuts in before the teacher can answer. 

What were my scores? 

Well, Diaz, your scores were l, 4, 5 and 4, 

respectively. All of those scores were below 
average, some way below average. 

I guess that's why I got a D. 

Sure . 

suddenly interrupts. 

Wait a minute. I don't understand. Diaz was below 
ave3 - a 9 e on svery question, but I was below average 
on only two of the questions. I was above average 
on two of the four. But she got a D and I got an 
F. That doesn't make sense. 

I'm sorry, the numbers don't lie. And, I'm sorry, 
but our time is up. 

2 . Teacher 2 



6 



Regrettably, essay type exams are not the only kinds of exams 
that produce "weighting" problems such as those just described. 
Such problems also occur, perhaps even to a greater extent, when 
teachers administer exams that contain both essays and objective- 
type questions. Consider the following. Assume that the following 
figure contains the scores and grades that Teacher 2 gave on an 
exam that contained several essay questions, worth a total of 75 
points, and a set of obj ective- type questions, worth, together, 25 
points. (The term "E-Pts" in this figure, obviously, stands for 
essay points, and the term "O-Pts" stands for objective-question 
points.) This figure shows that Teacher 2, like most teachers who 
administer such exams, simply added up point totals and assigned 
grades accordingly. 

Figure 2 



NAM? 


F-PTS 


O-PTS 




GRP 


PHI LL 


A? 


26 


37 


A 


C.HFNG 


61 


20 


31 


R 


JOWFC 


56. 


25 


ao_ 


ft 


DI A7 


64 


10 


74 


0 


ADAM* 


57 


20 


77 


r 




6n 


20 


70 


r 


PLUPC 


64 


5 


69 




JOHHS 


62 


5 


6.7 


0 


UFNDI 


62 


-Q_ 


67 


f 


HOPFF 


6 1_ 


0 


61 


F 












MEM 


59.3 


12 


7ZJl _ 


- 



Imagine now that several of Teacher 2's students have appeared 
for routine grade conferences. Quickly, however, these grade 
conferences turn nasty. 



Diaz : 



Teacher 2 : 



Diaz : 

Teacher 2 : 



Diaz : 



Both Adams and I got C's. But we would like to see 
how we did on the different parts of the exam. 

Sure, no problem. Diaz, you got 64 points on the 
essays and 10 points on the objectives. 

How's that in relation to other people? 

Good question. Well, a 64 on the essays was quite 
good. In fact, very good. That was the highest 
score that I gave. (You shared it with one other 
person.) But, the 10 on the objectives was well 
below average. That really pulled you down. So, 
you got a C. 

Let me make sure I understand this. I got the 
highest score that you gave on the essays. And 
they were worth 75% of the grade. I still only got 
a C for the course.- 



Teacher 2 : 



Diaz : 



That's correct. The total number of points vou 

got, 74, was just about average, so you got the 
average grade, C. 1 y cne 

But how could I get the average number of points 
best? ^ SSSayS ' which were worth 75%, were the 



Teacher 2: I'm sorry, but the numbers don't lie 

Adams then takes over. 



Adams : 
Teacher 2 

Adams : 
Teacher 2: 

Diaz : 



Teacher 2 
Diaz : 

Teacher 2 : 
Diaz : 

Teacher 2 : 



Can you tell me about my individual scores? 

Sure. You got 52 points on the essays. That was 
substantially below average. 

How did I do on the objectives. 

You got 20 points on the objectives. That was 
quite. The objectives, in other words, pulled you 

ULio T>, t0tal points, 72, put you right in the 
middle. Thus, you got a C. 

[Cutting in] Wait a minute, wait a minute. I'm 
confused Let me get this straight. Adams did 
vepr poorly on 75% of the test but very well on 
25 j. And I did very well on 75% and very poorly on 

That's correct. 

But we got the same grade. How's that possible? I 
should have gotten a much better grade than him. 

l',«K SOrr 2* ? OU tWO got ^ ust atlout the same total 
°1 point K s - So 1 had to give you the same 
grade. The numbers don't lie. 



But, wait, 



1 have anymore time. The numbers 

simply do not lie. 



Assume now that Diaz and Adams leave 
Wendi and Jones. 



But in walk Plurs, 



Teacher 2 : 



Let s get right to it. Plurs, you got 64 points on 
the essays and 5 points on the objectives for a 
total of 69. That total is very bad. So you got a 



O 

ERIC 



8 



Plurs : 



What were those individual numbers like in 
comparison to others . 



Teacher 2 : 


64 was quite good, in fact, excellent. On the 
essays you shared the highest score. 5 points on 
the objectives, however, was quite poor. So your 
total of 69 was, as I said, very poor. Frankly, a 
D+ was a gift. ' 


Plurs : 


I don't get it. I did excellent on 75% of the exam 
and I got a D+? 



Before the teacher can answer, Wendi cuts in. 



Wendi : 


How did I do? 


Teacher 2 : 


Well, again you did well on the essays, namely 62 
points. That was a bit above the average. But you 
did terribly on the objectives, namely 0 points. 
Your total, 62, was the second lowest. So, sadly, 
I had to fail you. 


Wendi : 


Wait a minute. How's that possible. I got above 
average on 75% of the exam and I still failed. 


Teacher 2 : 


That's correct. The objectives killed you. 


Wendi : 


But they were worth only 25%. That's a small part. 


Teacher 2 : 


I have to repeat. The numbers don't lie. You had 
the second lowest overall point total. I had to 
fail you. 


Plurs 


jumps back in. 


Plurs : 


Wendi ' s situations, in other words, is sort of like 
mine. I did excellent on the essays -- which you 
said were worth 75%.-- and I still got a D+. 


Hoppe 
to her. 


then cuts off Plurs, anxious to find out what happened 


Hoppe : 


You failed me also. What happened with me? 


Teacher 2 : 


Well, Hoppe, you failed because your point total 
was very bad. In fact, that total, 61, was the 
lowest anybody got. 


Hoppe : 


How did those points break down? 



Teacher 2: 



Hoppe : 



Teacher 2: 
Plurs : 



Teacher 2: 



Well, you did pretty 
points. That was a 
blew the objectives. 



good on the essays, namely, 61 
bit above average. But you 
On them you got 0 points. 



This sounds like what happened with Wendi. Above 
average on 75% of the exam; very poor on 25%. 



Correct . 



[Speaking to Wend i and Hoppe] . Well, you think you 
have it bad. I did excellent work on the essays, 
right at the top of the whole class. And I got a 



I'm sorry but I have to run off to a faculty 
meeting . Just remember. Numbers cannot lie. 



3 . Combining Scores 

do nl he examples reveal, of course, that in fact numbers 

do lie, at least when it comes to grades and grading. Obviously 

Wen 5 wr ° n 5 as Teachers 1 and 2 combined the scores from 

t parts of the exam, and assigned grades. But, what was it 
that went wrong? u 



„ Ex P erts in statistics and educational measurement know that 
sets of numbers can be widely spread out or tremendouslv 
compressed or anything . in between. 6 Thus, for example, if many 
of the students who take a test get very similar scores on one of 
the questions, the scores for that question will be very 
compressed. Conversely, if many of the students who take that test 
get very different scores on another question, the scores for that 

wil1 j? e wi dely spread out. Although experts 
dfljpjp® thlS sort , of friability and compression in various 
different ways, perhaps the most common way of describinq it 
involves use of the so-called "standard deviation." The standard 
deviation of a set of numbers is, simply, a number that describes 
how spread out or compressed the individual numbers in that set 
Jhe higher the standard deviation, the larger the spread 

^L°l^ Ur ! e ' ViCS Versa * (Note: Even People who know nothing 

a tlatics c 5 n easil y calculate standard deviations on any 
computerized spread sheet.) 1 

r,^,=H EX \. ertS - in stat i st i cs a ^d educational measurement long ago 
SSmSf,.. ° f a counter-intuitive fact about se?s of 

L ? standard deviations. If such sets are 

!p™& added ' the l e ex Perts discovered, sets with high standard 
° nS (sets are widely spread out) inadvertently may end 

up counting more than sets with low standard deviations (sets that 
are tightly compressed) . 




10 



A simple illustration shows why this counter-intuitive thing 
happens. Imagine that somebody has been asked to sort a group of 
people from large to small and has been told, to factor the height 
and weight of the people in the group equally while doing so. 
Imagine also, however, that most of the people in the^ group turn 
out to be quite similar in height but very, very different in 
weight. If this occurs, the similarity in height will cause height 
to be discounted as a sorting factor. Conversely, the differences 
in weight will cause weight to be exaggerated as a sorting factor. 

Not surprisingly, the same thing can happen with sets of 
scores. Assume, for example, that a teacher plans to count each of 
two essays on a test equally. 7 Assume also, however, that scores 
on the first essay range from 80 to 100 and the scores on the 
second essay range from 20 to 100. Finally, assume that one 
student has the highest score on the first essay, 100 points, and 
the lowest score on the second, 20 points. And assume that another 
student has the lowest score on the first essay, 80 points, and the 
highest score on the second, 100 points. Since the performance of 
these two students was exactly the same -- highest on one, lowest 
on the other -- the grades they get should be the same. But, 
addition reveals that the first student will get 120 points and the 
second 180 points. 

In many grading situations, of course, this differential 
weighting phenomenon will have no real world impact. But, consider 
what might happen if an individual student did well on a question 
that ended up being under-weighted and poorly on a question that 
ended up being over- weighted. Or consider what might happen if 
exactly the opposite happened for a different student. In these 
situations, major grading mistakes could occur. The first student 
described could easily get a significantly worse grade than 
deserved. And the second student could easily get a significantly 
better grade than deserved. 

Not surprisingly, this "double whammy" phenomenon is exactly 
what happened to both Teacher 1 and Teacher 2. The scores on 
Questions 1 and 4 on Teacher l's test are tightly compressed, with 
standard deviations of .82 and .63 respectively. Conversely, the 
scores on questions 2 and 3 on Teacher l's exam are widely spread 
out, with standard deviations of 2.72. and 2.94 respectively. This 
means, of course, that when Teacher 1 added up the scores on her 
exam she inadvertently gave too much weight to questions 2 and 3 , 
and too little weight to questions 1 and 4. The same thing 
happened to Teacher 2. The essay scores on Teacher 2's were 
tightly compressed, with a standard deviation of 5.06. Conversely, 
the objective scores were widely spread out, with a standard 
deviation of 10.06. Thus, when Teacher 2 added up point totals, he 
ended up distinctly undervaluing the essays ended and distinctly 
overvaluing the objectives. 



*» in T ° £ 

Questions l and 4 hut nnlv avB r a « a ,, fl™ ' very good work on 

B ecause o£ the double whammy phenomenon, 1 his" pointer 3 i 

significantly under-rated hi* J nis P° lnt total 

situation with Teacher l however ^^or, fV* 16 exam ' ^s's 
did very poor work on 'ouestin^ , exactly the opposite. Johns 

rated^his p^r^rmranceT^Cons^ide^r also £ some°of ^s^st^d^t^ 

Hoppe and Wendi and Plurs, it should be recalled all did a^f ntS * 
or superior work on the essays -- worth 75V * 7 ' dld avera 3 ® 
terrible work on the objectives worth %i nf ^ " but 

SSSsSS— ^ »a. s a 

Mna * r — ^ ntI y ended up counting much more than 25% anH 

thIn e 7 5 J yS ° n that tSSt ^ed up counting much l^s 

p?oSuce ve^ comp res se d sltf' 'o? 13 ," 3r3Ue thaC ^ esbions «»' 

Dlav nn rni? ■ /I the wex Shtxng problem s just described will 

teachers lil! £ cTzrlT^li^L^ In fa « ^«e 

more than poor questions ' and if t-nt-Tl 83 ' 1 - 0 ^ 8 ~ qhl ' co be value d 
thing that matters, then wei^ iV Wugs 

!:ss,r-“»- ■“ •«»«“. s:';;:‘n,s.r;ras 



ERIC 



In response to your questions regarding the 
weight- assigned to the different questions 
n this exam, I must, in all candor, say this. 

imi^h'fhJ know at che Present time how 
much the different questions on the exam will 

end up being worth. Some of them may end up 
counting for a lot, and some of them may up 
counting for very little. Or, they ma/aS 
end up counting the same. I won't know the 

sihce fl T S d ntl1 J ' m d ° ne jading. Further, 
? . , 1 . do not now know how much the 
individual questions will be worth, I simply 

cannot now tell you how much time to spend on 
them Perhaps, you should divide your time 
equally. Or, perhaps you should blow off one 

12 



or more of them and concentrate on the others . 

Oh, and good luck to you all. 

Obviously, teachers cannot actually say things like this to 
their students. But, if they mean for point totals to be the only 
thing that counts, and if they mean for under-valuing and over- 
valuing to occur, then they must say this to their students. 

4 . "Weighting" Solutions 

Fortunately, teachers need not go in front of their students 
and throw bombs like the one just described in order to avoid 
differential weighting problems. Rather, teachers can use either 
of two. simple techniques. 8 First, teachers can employ the same 
curve in connection with the scores or grades that they give on all 
of the different parts of an exam. 9 Thus, for example, a teacher 
can assign letter grades to all of the parts of an exam in light 
of, say, the following curve -- A's - 20%, B's = 30%, C's = 30%, 
D's = 20%. Once teachers do this, they create sets of scores that 
are the same in terms of compression or variability. And, if sets 
have the same degree of compression or variability, the 
differential weighting phenomenon does not occur. 

Regrettably, teachers who employ a curving procedure like this 
can encounter problems of a different kind.. For one thing, this 
approach forces teachers into an extra grading step, a step wherein 
the teachers convert the raw scores into curved scores. Further, 
use of this technique places teachers in a grading straight jacket. 
Scores on different questions, after all, will not naturally fall 
into the precise same pattern. 



Fortunately, another technique for solving the differential 
weighting problem also exists, a technique that can easily be used 
by any teacher who has access to, or whose secretary has access to, 
a computerized spread sheet. This technique involves the 
conversion of "raw" scores into "standardized" scores. (Raw scores 
are the actual scores that teachers give to students when exams are 
graded.) Once raw scores are standardized, they can be added up, 
subtracted, averaged, combined and the like without any fear of 
weighting distortions. 10 

Standardized scores -- the most common being the so-called "Z- 
scores" or "T- scores" -- describe the distance of a raw score from 
the mean score in a group of scores. 11 Thus, a Z- score of -2.14 
indicates that the raw score at issue is 2.14 standard deviations 
below the mean score in the group. (This, incidentally, is a very 
low score.) Conversely, a Z-score of [+]2.14 indicates that a 
particular raw score is 2.14 standard deviations above the mean 
score in" the group of scores involved. Z-scores, which are more 
basic kind of standardized scores, are created by subtracting the 



fc M 



me an sc° re in a set of scores from the raw score involved, and then 
dividing the resulting number by the standard deviation of the set 
of scores involved. 12 




Raw Score - Mean Score of Group 



Standard Deviation of Group 

All of this again brings the analysis back to the grades of 
Teachers' 1 and 2. The following figure shows the 2- scores 
including total or average Z-scores for Teacher l's test. This 
figure also shows what happens when the pertinent grading data is 
sorted, not by point totals, as Teacher 1 originally sorted the 
data, but by Z- totals. Finally, this figure shows what would have 
happened at Teacher l assigned grades by 2 -scores rather than by 
point scores, usj.ng the precise same overall grading "curve." 

FIGURE 3 




Everything now makes a great deal more sense. Consider Smith. 
He, it should be recalled, did the best work in the class on two of 
the four questions, and about average on two others. Nobody else 
did such strong work on so much of the test. Point totals give him 
a B. Standardized scores give him an A. Which seems more 
appropriate? Or consider Johns. Admittedly, he did very poor 
work, however, it be scored. But, point totals give him the lowest 
grade in the class, and standardized scores give him the second 
lowest. Which seems more appropriate. 

Z- score operations reveal similar problems -- albeit even more 
serious ones -- in connection with Teacher 2's grades. The 
following figure shows the Z-scores for the different parts of 
Teacher 2's exam for individual students. Further, this figure 
shows Z-score "totals.'' (Because Teacher 2 intended to assign 3 
times as much weight to her essays as to her obi ective- type 
questions, the z-total reflects that kind of multiplication.) 
Finally, this figure shows all of the data sorted by Z- totals 
rather than by point totals. And, again, this figure shows what 
would have happened had Teacher 2 assigned letter grades, using the 
same curve, in light of z-scores rather than point totals. 






14 



Figure 4 



Id AMF 


F-PTS 


7 


O-PTS 






7-TT1 


p-npn 


7-OPD 


PH M 1 


62 


0-53 


ps 


1-19 


az 


?-7fl 


A 


A 


n T A7 


64 


0.03 


10 


-Q-3 


74 


2-49 


r 


a 


Pt 1 IP S 


t>L 


0.03 


s 


-Q-a 


69 


1,99 


n* 


a 


rwFwr, 


61 


n 3 l 


20 





ai 


1-72 


a 


r 


muus 


to 


n S3 


5 


■0-8 


6 7 


fl-22 


n 


r- 


upun r 


to 


a- 53 


0 


-1-29 


62 


025 


F 


H 


HDPPF 


61 


0.34 


0 


1±22. 


61 


-0.27 


F 




JONFS 


55 


-o.as 


25 


1-19 


ao 


-1-36 


— a 


Q 


ADAMS 


52 


-1.44 


20 


Q.7 


72 


LLj 62 


e 


F 


SM T TH 


sn 


-1.84 


2 0 


0-7 


Zfl 


-4.82 


C 


-E 




















mpau 


so 3 




13 




72,3 








S.D. 


5.06 




10.06 













Consider Plurs . Plurs had the highest score given on 75% of the 
exam (the essays) but, he blew the objectives, worth 25% Point 
totals give him a "D+". Z-scores, however, give him a "B". Which 
grade makes more sense? Or consider Diaz. Diaz had the highest 
score given on the essays, and did just below average work on the 
objectives. Should she get a "C" (point scores) or an "A" (Z- 

scores) ? And consider the other end of the scale. Adams and 
Smith did very, very poorly on the essays but well above average on 
the objectives. Conversely, Wendi and Hoppe did above average on 
the essays and very, very poorly on the objectives. If F's have 
to be given at all, who should get them? Wendi and Hoppe (point 
totals)? Or Adams and Smith (Z-scores)? 

5 . Real World Data 

What then about real world data? 

As noted at the outset here, 13 different sets of real grades 
were studied in connection with the preparation of this analysis. 
Of those 13, 10 involved combinations of scores. (3 sets of grades 
were from tests that were made up entirely of equally weighted 
obj ective- type questions.) Some of these 10, for example, involved 
combinations of scores on several different essays. Others 

involved combinations of scores on essays and scores on sets of 

obj ective - type questions. (And two of these 10 involved 

combinations of scores from different tests.) The results of Z- 
score analysis of these ten sets of grades are this: 

1. On 10 out of 10 of these sets of grades, Z-score 

calculations produced different rankings for students 
than point total calculations . In other words, on 10 out 
of 10 sets of grades, point totals presented distorted 
pictures of at least some students' performance on the 
exams . 

2. On 8 out of 10 of these sets of grades, z-score 

calculations would have produced different grade s. f° r at 
least some students. In other words, in 8 out of 10 of 



these sets, students who actually did better \ 

~ (ofhlgSrl 
3 ' reL^lvely^Lr^ J ^ e ™ : * i “ d 

example , a couple^s^dents "got' B°s Zll tLv^ho*?^ 
have gotten C+'s or B+'s. In Several of ?h^» 7 a hOUld 
however, major grade distortions occurred. 2 Sets ' 

point C just d Mde. r %roflsstr e B a Ts :Le a in c 1 onnect . ion " ith the last 

school. Teacher B, like Teacher 1 iJ r ^ f CeaCher ac a =£41 law 
a test made up of several essav rmo<= ^ • fore 9° ln 9 examples, gave 
scores for some of the LsVvV Y J^Z1 !tl ° ns * D As with Teacher 1, 
compressed, and scores on other -c.- 630 * 161 " B>S exam were rath er 
Despite that fact however ProfSIInr ^ ather s P read out * 

points. The following figure showf d ^ Slmply totaled up the 
totals, and the actual graded thah J Professor B's actual point 
also shows the results Iff h ® 9av ^ J he follow ing figure 

"sorting.- The student^wi^h the " °gh\st l-^re °" S 

best actual performance nn «->,« 9 • ^ score ** and hence the 

Conversely, the student with the lowest^ score 6 t0? °h f he list ’ 
worst actual performance on the exaT - Y<q 5 nd hence the 

list. (Note: Undpri inoH — ^ the bottom of the 

incidentally, are grades that wefre "adlusted" 1 ^ followin g, figure, 
participation and the like ^ U ^f ed f° r things like class 

ignored^in should be 



o 

ERLC 



16 



Figure 5 



I-D- 


PTS 


G7D 


Z-TTL 




80 




7.91 


2£ 


87 


A 


7.76 


37 


86 


A 


7 57 1 


47. 


86 


A 


5 56 


76 


85 


A 


5 7 


64 


81 


R+ 


7 07 


67 


81 


8 + 


2.9 


57 


81 


5+ 


2.74 


74 


79 


8+ 


2.57 


66_ 


fll 


A 


7.51 


66 


78 


8+ 


2.46 


51 


82. 


8* 


7.46 


11- 


77 


8+ 


7-17 


1«5 


77 


_8 


1 97 


18 


76 


8+ 


1.87 


52 


70 


s 


1.67 


59 


72 


8 


1.57 | 


IQ 


72 


s 


1.28 


68 


71 


r 


1.09 


68 


77 


3 


1-09 


70 


74 ' 


8 


0 98 


9 


68 


a 


n 97 


19 


71 


3 


Q.77 


J52 


68 


8* 


0.68 


14 


69 


5 


0 68 


55 


67 


O 


0-77 


16 


65 


o 


0.47 


10 


68 


a 


0.47 


15 


65 


0* 


-0-77 


- 69 


66 


r+ 


0 19 


4 


67 


r> 


0.15 


27 


62 


_c+ 


0.09 


17 


-67 




0 07 


24 


64 


0* 


*0.09 


56 


61 


r+ 


-Q.il 


61 


. _59 


r 


-0.77 


60 


61 


r* 


-0.17 


54 


67 


r* 


-0 11 


57 


60 


_Q+ 


-0.14 


44 


58 


c 


-0.47 


\J 

» 


61 




-0 47 


25 


60 


0* 


-0-51 


65 


60 


c* 


-0.61 


2 


68 


0 


-.0.7 


a 


57 


r 


-0 95 ' 


49 


54 


r 


-1.08 


12 


56 


c 


-1.14 


47 


51 


c 


-1.19 


61 


54 


0 


- 1-29 


27 


55 


r 


-1.14 


81 


57 


r 


-1.19 


42 


66_ 


r 


-1.44 


50. 


57 


r 


•1.45 


2 


55 


r 


-1 58 


57 


57 


C 


-1.78 


48. 


48 


0+ 


-1.81 


22, 


49 


0- 


-2 14 


59 


47 


n* 


-2 7 


45 


47 


0* 


-7.76 


46 


42L 


0* 


-7 76 


62 


50 


0 + 


-7.17 


21 


50 


r 


-7.44 


58, 


46 


0+ 


-2.67 


5 


49 


D+ 


-2.67 


6 


45 




-7.69 


40 


46 


0* 


-7.76 


7 


40 


0* 


-7 95 


1 


. 45 


o* 


-3_Q<5 


41 


15 


0 


-4.04 




20 



Note the discrepancies between point scores and standardized 
scores. Student 36 actually did the best work in the class. But 
she only got a B+. Five students who did poorer work, got A's.. 
Likewise, Student 68 got a C despite the fact that he did work that 
was roughly comparable to that done by students who generally got 
B's. Note particularly, however, Student 33. This student got a 
D+ for the course despite the fact that numerous students who did 
poo^sr work on the exam got C's and some who did poorer work even 
got C+'s. Conversely, note Student 21' s incredible good luck. 
This student got a C despite the fact that his work, was comparable 
to work that earned other students D+'s. 

Consider also another real world example. Professor F, like 
Teacher 2 in the foregoing analysis, administered a test that 
combined several essays and a set of obj ective- type questions. 
Scores on F's essays, like Scores on Teacher 2's essays, tended to 
be very compressed. Conversely, scores on F's objectives, like 
scores on Teacher 2's essays, were widely spread out. Like Teacher 
2, Professor F simply totaled up point scores. The following 
figure shows the actual point totals that' Professor F calculated, 
and the actual grades that she gave. This figure also shows , 
however, the results of z- score calculations. The student with the 
highest Z- score -- hence the best performance on the exam - - is at 
the top of the list, and the student with the lowest z- score -- 
hence the worst performance on the exam -- is at the bottom of the 
list . 

Figure 6 




O 

ERLC 



18 



t n 


2T1 


n?n 


7'T T! 


"? ri11 


91 


A 


1-03 


F9 


90 


A 


o.aa 


Cl 


37 


A 


0-31 


33 


33 


A 


0 74 


4 


A5 


R+ 


0 _ 72 


23 


37 


A 


0-66 


61 


a4 


B 


0-64 


4a 


36 


a* 


0-62 




36 


a* 


0-61 


15 


37 


A 


0-6 


as 


32 


R 


0-57 


n 


33 


A 


0.53 


44 


AS 


_a* 


0 S3 


57 


35 


a+ 


0-52 


3? 


81 


3 


0-19 


1 


31 


-B. 


0-37 


7? 


31 


a 


0.37 


16 


31 


a 


2-15 


5? 


-31 


a 


0 32 


3n 


31 


a 


0 31 


6 


73 


r 


0 3 


65 


79 


o 


0-29 




SO. 


a 


0-23 


56 


79 


t> 


0.27 


72 


. 79 


c+ 


0.27 


—67 


79 


£> 


0.24 


U 


79 


r+ 


0_.24 


2 


79 


r- 


0 17 


1 2 


79 


r* 


0 13 


63 


73- 


C 


0.09 


2i 


7a 


C 


0-06 


19 


76 


c 


2-25 


17 


76 


■_2 


0.05 


ct 


1L 


c 


0.04 


j65 


SO 


a 


0-01 


A1 


25 


_o 


0-02 


PC 


75 


r 


0.02 


10 




c 


0-01 


1 1 


25 


0 


-0-01 


75 


7P 


2 


-0.03 


23 


74 


r 


-0.11 


49 


_62 


■ 0* 


-0.14 


19 


74 


0 


-0.15 ■ 


77 


70 


r 


-0.15 


7 


T? 


r 


-n_ 16 


36 


7U 


c 


-0-17 


_54 


72 


C 


-0-13 


70 


74 


0 


-2.2 


64 


70 


0 


-0-2 


41 


70 


0 


-0.2 


?9 


-69 


0 + 


-0.24 


57 


69 


■ 0- 


-0 23 


_C5 


67 


_D 


-2 29 


62 


72 


c 


-0-3 


60 


72 


r 


-0-3 


47 


_69 


2* 


-0.3 


65 


63 


2 


-0 31 


21 


69 


0+ 


-0-33 


3 


67 


n 


-0.33 


46 


A3 


0 


-0 34 


7 


66 


n 


-0 36 


P6 


67 


2 


-0-39 


27 


70 


r 


-0- 45_ 


53 


69 


n+ 


-0-46 


4P 


6?' 


p 


-0-51 


9 


63 


D 


-0-51 


71 


61 


F 


-0-62 


P0 


66 


n 


-0 67 


50 


6? 


F 


-071 


37 


57 


F 


-1 


66 


53 


F 


-1.01 


18 


57 


F 


-1 03 


40 


45 


F 


-1-68 



Note t ? e discrepancies. Student 61 got a B despite the fact 
that several students who did poorer work on the exam got A's and 
B + s • Student 69 had better luck. He got a B though his work was 
roughly comparable to that of students who got C's. Likewise 
Student 27 caught quite a break. He got a C despite the fact thlt 
his work was comparable to that done by students who generally 5“ 
D 2',, NoC f £ i na i ly the extraordinarily bad luck of the Students 9 42 
l Both of these students failed the course despite the fa« 
that students who did poorer work obtained passing grades. 

6 • The Bar Examination 

grading" must "ncTbT *££“ Tt as £$&* i^fs^^r^ 
lie Hr examination C ° nS1C3era " 3e incerest c ° students. namely, 

No , . one , . co y ld dispute that bar examinations play an 
extraordinarily important role in the lives of law students and 
orotic* people who cannot pass bar examinations generally cannot 
w " Given that fact, the possible existence of 

interest 611 ^ Srr ° r in the bar a topic of considerab?J 

Fortunately, literature addressing bar exam issues indicates 

. ba *j e ^ ami ^ers are aware of the weighting problems just 

fbouf 1 ^ Stephen Klein ' for exara P le ' "ho has written extensively 
about measurement issues and the bar exam, addresses this weighting 

nrob?pmp cons iderable length. 1 Klein first notes that weighting 
< I an OC( \ ur ” hen the scores for the different essays on the 
essay portion of the exam are added up. This is comparable of 
course, to Teacher l's situation. Weighting problems can occur in 
of 1 f h( f° ntext ' Klei ,f notes, because the compression or variability 

dfviationS reS °r*n bP f6rent ess *Y? -- technically, the standard 
deviations -- can be very, very different. Klein notes in this 

context, incidentally, that these problems are "not trivial n14 

dLil^LsTere"^ 8 continues "essay questions whose standard 
nt =ns were., three times as great as the standard deviations of 

other questions on the same examination." 13 in other words Klein 

hat^rftVafh 1 ? 3 °V he e K Say portions bar examinations 
that are just as bad as the problems described herein. 

ran n^r 600 ^ ar ^ in which Klein notes that weighting problems 
hSein rm l roughly comparable to what happened to Teacher 2 
On bar examination, essay portions of those exams must be 
r portions. « As soon as such combining is 

melnS S however ' "eighdng issues arise. Fortunately, the 
be n S V 0 aVOid Chis Particular problem need not 

here ?h?f h Rather, the only thing that needs to be said 

„ 4 -v, t b . ar exanuners do not seem to make weighting mistakes 
questions combine score s from essay que^igLons and obj ective- type 



O 

ERIC 



20 



These references to bar exam practice raise a final point. If 
bar examiners can do the things necessary to eliminate weighting 
problems, why cannot law school teachers also do those things. 
Admittedly, the score that students receive on the bar exam is much 
more important than the score that students get on any individual 
law school test. Nevertheless, real similarity exists between 
these situations. In both, a single test is used to make an 
important educational decision about an individual person. 

B . The "Reliability" of Tests 

Regrettably, weighting errors such as the ones just described 
are not the only kinds of measurement errors that law school 
teachers probably make in connection with law school exams. Most 
law school teachers seem to know little or nothing about the 
measurement concept of "reliability." 17 Thus, not surprisingly, 
many such teachers almost certainly make reliability errors. 

1 . Teacher 3 

The figure below contains the scores that another hypothetical 
teacher, "Teacher 3, " gave on an exam that consisted of six equally 
weighted essay questions. 

Figure 7 



•HAMS, — , 


*1 


R 


£2 




ft 


16 


TTL 




JONFS 


A 


a 


7 


9 


A 


A 


42 


A 


ADAMS 


9 


7 


a 


A 


T 


5 


47 


A 


WFND1 


a 


A 


-A 


A 


9 


A 


4? 


A 


PI 1 IPS 


a 


A 


0 


7 


5 


25 


7K 


R+ 


HflPPF 


a 


5 


A 


A 


7 


5 


77 


R 


PHILL 


5 


9 


3 


7 


4 


a 


36 


o 


CHENG 


6_ 


7 


5 


7 


6 


5 


36 


o 


SMITH 


A 


c 


_5 


4 


4 


7 


31 


c 


DIA7 


2 


a 




7 


A 


2 


3D 


c 


JOHNS 


4 


4 


A 


4 


A 


5 


29 


D- 




















MF AN • 


A 7 


A. 5 


A 7 


A. 7 


A. 7 


5 


7A _ 3 




s.n. ; 


2.15 


l^sa 


2.15 


1^55 


1.62 


1.89 


4.97 





A quick look at these scores and grades reveals that the 
standard deviations for the scores for the different questions are 
somewhat different. In other words, scores on some of the 
questions were more compressed, than scores' on other questions. 
Thus, "weighting" problems such as those already described. may well 
have occurred. This quick look at these scores and grades, 
however, almost certainly does not reveal the existence of another 
serious measurement error problem. But, such a problem exists. 
And it is a problem that is in many ways much more serious than the 
weighting problem. 



Diaz , 
3 . 



Another grade conference from Hell introduces this problem. 
Jones, Phill and Smith appear for a conference with Teacher 



Teacher 3 : 



Diaz : 

Teacher 3 : 
Diaz : 

Teacher 3 : 



Diaz : 



Teacher 3 : 
Diaz : 

Teacher 3 : 
Diaz : 



Teacher 3 : 



Well, let's get right to it. You three will recall 
that the test had six equally weighted questions, 
each worth a maximum of 10 points. They were, in 
my opinion, equally difficult. And they dealt with 
equally important areas of the law. 

(Impatient) How did I do? 

Well, Diaz, you got a total of 3 0 points. Way 
below average. In fact, you're lucky you got a C 
and not a D. 

Well, how did I do on the individual questions? 

I guess the best way to describe it is to say you 
really ran hot and cold. On about half of the 
questions you did very poor work. And on about 
half you did pretty good work. Very inconsistent. 

You know, that doesn't surprise me. I just kept 
losing my concentration during the exam. 

Well, that explains it. 

Well, don't you think that the school should do 
something about the air-conditioning in that room. 

What do you mean? 

You mean, you don't know about the room. That room 
is unbelievable. When the air conditioner comes 
on, it seem like the North Pole. When the machine 
shuts down, its like the Gobi desert. 

I didn't notice that at the front of the room. 



Diaz : 



Jones : 




Not everybody can sit at the front of the room, 
Teacher. And, some people maybe had different 
clothes than I. My friend, for example, had a 
sweater and a light shirt. She just kept changing. 
I didn't think of that-. I just had on a heavy 
sweatshirt. I suppose, I could have done like that 
student at Berkeley, the "naked guy." But, 

frankly, I didn't really think of that during the 
exam . 

(Cutting in) Well, the temperature was fine where 
I was in the room. How did I do, Teacher? 

a 



Teacher 3 : 



Jones : 
Teacher 3 : 

Jones : 
Teacher 3 : 
Jones : 



Phill : 
Teacher 3 : 



Phill : 
Teacher 3 : 
Phill : 



Teacher 3 : 

Smith : 

Teacher 3 : 
Smith : 



You got 42 points, Jones. That's a tie for the 
highest given. That's why you got an A. 

How did I do on the individual questions? 

Well, let's see. You also ran hot and cold too. A 
couple of your scores were really high. And a 
couple were pretty low. Real inconsistent. 

That ' s funny . 

What do you. mean? 

Well, our study group scheduled reviews for a 
couple of successive nights right before the exam. 
We devoted different nights to different topics. I 
missed about half of those sessions, the ones, I 
bet, that covered the stuff that threw me on the 
exam. Wow, what a lucky break that I did not miss 
another one, the one that dealt with the topic that 
you addressed in two questions. 

[Interrupting] And me? How did I do? 

Well, Phill, you were just about in the middle of 
the pack total -wise, with 36 points. That put you 
in the middle of the pack. Hence, a C+ . On the 
individual questions? You ran hot and cold. And 
warm. In fact, you got scores all over the map. 

What about my handwriting? 

What are you talking about? 

Well, I've been told that my handwriting is hard to 
read. So, I try to print answers as much as 
possible in the blue books. But, a lot' of times I 
forget and slip back into using handwriting. 

Regardless of what you may be thinking, I pay 
absolutely no attention to handwriting when I grade 
exams. Handwriting plays no role in the grading. 

[Cutting in] . You know, Teacher 3, that reminds me 
of something that I wanted to ask. Do you mind? 

Well, go ahead. We'll see. 

You know how rumors go around, and I don't like to 
give them any credence, but let me ask you this. 
I've heard that you got a call from the Harvard Law 



O 

ERIC 



23 



26 



M *> 



Teacher 3 : 
Smith: 



Teacher 3 : 



Smith: 



Teacher 3 ; 



Re view while you were grading the exams, a call 
accepting one of your papers. 

Yah, that's part of what happened. 

J! ve also heard, however, that two days after that 
the Harvard people called back and said 
that they d made a mistake. They'd mixed up your 
article with somebody else's and they really meant 
to publish that other one. 

Yep. You cannot imagine how mad I was. I was 
tremendously excited for a couple days, and then 
absolutely heart broken. 

Well, my question is this. When were you gradinq 
Ey paper. Were you grading it when you got the 
first call, or when you got the second call? 

Don't be ridiculous. Those sorts of things play no 
role whatsoever in grading. F 7 

2 . Consi stency in Measurement 

nrnr ^ yl ^ dy who has tried measure anything knows that the 
process of measurement itself often produces error A trusted 

eJceJdPri SCalS ; for exa mple, might show that we have achieved -- or 

s ten blck on We , lgh 5- Five seconds later, however, when we 

tbat scale for confirmation, the scale displays a 
wiffS? weight, . a pound or two higher or lower. Obviously 
eri?S fc S n? 0t gained or lost in those few seconds. Rather, the 
what miah? h P UC ? d er ror in measurement. Likewise, consider 
what _ might happen if we used an old, old watch to time class 

SS f^ 1 j nS ‘ . warm days, the watch might run a little fast and on 

alwavs^to L 1 ^ 16 Sl ° W * ThUS ' though we might intend the class 
to ba 55 minutes long, sometimes it actually runs for 54 

c^ar 'Thfproce« m o? f ° r 56 * Again ' What has Opened he?e Is 
clear. The process of measurement itself produced error. 

The same thing occurs in connection with educational 
handwri f?na* Tests themselves produce error. A teacher who allows 
examnfi lnfluence her judgment on essay answers, for 
fS s ; ^educes error into the measurement process itself. 

j ! 1 V lf a Particular student just happened not to study a 
role in h ?hV teacher tests heavily, chance will play a significant 
coi?si ^ c SC °i e that ^ that student gets on a test. And chance, of 
condi i- s nna another y ord f°n measurement error. Further, if 

examDlp°^ S . ^ administration of a test -- air conditioning, for 

manv ? ^-rr-n-r • -n & r ° le ^ Performance of some students, or 

many, error will necessarily infect test scores. 






O 

ERIC 



24 



27 



{ Not surprisingly , measurement experts have developed methods 

for measuring the amount of error that a measuring instrument 
itself produces. Measuring instruments that produce very little 
error are said to be high in "reliability." They then get number 
ratings that approach 1.00. (1.00 indicates perfect reliability.) 

Conversely, measuring instruments that produce a lot of error are 
said to be low in reliability and get number rating that approach 
0.00. (0.00 indicates that an instrument measures nothing but 

error or chance.) 

Consider reliability ratings for two different clocks. The 
old watch might have a reliability rating of, say, .8. This means, 
roughly, that this instrument pretty much of the time produces 
measurements that are pretty much alike. Conversely, a $10 million 
dollar clock that scientists use in connection with particle 
physics experiments might have a reliability • rating of, say, 

. 99999999. This means that this instrument virtually always 
produces measurements that are virtually identical . 

Educational statisticians have developed a number of ways to., 
measure the reliability of tests. Though the technical reasons for- 
why these techniques work are beyond the scope of the present 
analysis, the techniques themselves --at least some of them -- are. 
relatively easy for non-experts to use. Teachers who wish to 
calculate the reliability of essay tests, for example, can simply 
plug the pertinent numbers from the tests that they give into the 
"co-efficient alpha" formula. 18 When the calculations are 
completed, the reliability of that test is described. 

Sum of Individual Item 
No. of Questions Variances 

Reliability = * 1 

1 - (No. of Variance of Total 

Questions) Scores 

At first glance, this formula looks daunting. In fact, however, 
teachers who have access to a computerized spread sheet can very', 
very quickly learn how to use is. Teachers using this formula 
divide the number of essay questions on the test by (1 minus the 
number of essay questions on the test) . Then teachers must 
multiply the resulting number by a number that is 1 minus (the sum 
of the "variances" of the scores of the different questions on the 
test divided by the variance of the total scores on the test) . 
Spread sheets, incidentally, can instantaneously calculate the 
variance of a -set of numbers. 

The easiest - to-use formula for calculating the reliability of 
objective -type tests is called the "Kuder-Richardson Formula 21." 



I l 



No. of Items 



Test Mean * (No. of Items - Test Mean) 



* 1 - 



O 

ERIC 



25 



28 



(No. of Items) 



1 



No. of Items * (Variance of Test. Scores) 



afUJU' 1 ^ ast su Perficially, this formula looks dauntino Bu t 
its use aC T^he^s ® acc ; ss t0 spread shee ts can quickly* master 
questions on the SgZ l It 

f^srsrsyss ^ 

rh _ a n 3 i C ^= reliabllit ^ ratin ? s for tests are calculated, the rest of 

ofr'lTa^ measurement. 0 regarding thfd^rJe 

hLe r \m« b 3 ^ ^er M tLe eS “i “t ' i ¥?'%' ““ C “ tative atandlrds" 
about of students, or if tests Le goTng to° ™ us^f^one 

ratino^nf actors contributing to a total score, then reliability 

aSe^Ife =° U Stino'o T ; 5 ° and - 60 - are 3 ener ail y considered 
?es r t f s C «e n io W be le the rat l in3 S' 0 ‘' 00 ' 

• j • . , «. the only thing that used to generate a scnrp f nr 

SUCh tSSCS shou ^ d reliability 

3 , iije e ^s e t r law Poi OW te d ache7s aS 

grades for individual students. Thus her test nnaht 
- ability rating of at least .85. 

test nowhere near approaches that degree of reliabilitv 
Calculation of reliability for that test -- using the 

lf P ^urs 1 T thaV ?Ms UC tl% a reliabilic y mating of . 21 . This means^ 

., rse, that this test is very, very low in reliabilitv m 

aoa words ' , the students who took this test were^o take it 
again, and if no "learning effects" occurred f u oaa £ 

The^reliabili ty ^of the 

the generally accepted figure “or bt' 'ujd 

®?i e . f deten ? :Lnan,: o£ individual students' grades it also is 

0 nly 1 part 11 of^students < * C grades S . CandardS £ ° r CeSC3 th3C “ iU rake u ? 

schools biZarr ?nme ke r ab °h C 3 ra ding sometimes makes the rounds in law 
throwing the blue booLTo™ “ s« Tsia&T* 

which t C he a brJe 3 bo^^a 3 !?" 6 ! licfm sc^ ^ ^ UP °" 

ofh^ ”’ e£hod " o£ grading has a 'reliability rating Sf^oo' T $n 
?'s test has ? U r. e i iS at . ”° rk - Bu£ ' as Just 3 noted Teache? 
than chance was involved i/tha"/ test 21 ThUS ' nOC Very " UCh m ° re 



26 29 



o 

ERIC 



3 . Real World Data 



l 



What then of the real world. As noted earlier, in connection 
with the preparation of the present analysis, 13 sets of grades 
were studied. Several comments about methodology, however, must 
initially be made. First, for some of the obj ective-only tests, 
reliability ratings were calculated using several different 
techniques, techniques that sometimes produce different ratings. 
All calculated ratings are listed. Second, technical problems made 
it impossible to calculate a reliability rating for two of the sets 
of grades submitted. Had it been possible to calculate reliability 
of these tests, however, technical reasons suggest that the ratings 
would have been quite low. Third, for technical reasons and 

because of insufficient data, the reliability of the separate parts 
of combination- type tests generally could not generally be 
calculated. ■ Fourth, all of the teachers who administered 
combination- type tests here intended the total score on the 
obj ective- type questions to be the equivalent of a single score on 
the essay- type questions. Hence, when reliability for these 
combination- type tests was calculated, that same notion was 
employed. 




1. Obj ective -Type Questions Only: Single Test determined 

course grade: 

Teacher M: .59 (co-efficient alpha); .47 (split halves) 

Teacher L: .53 ( co- efficient alpha); .37 (split halves) 

Teacher K: .73 (co-efficient alpha); .79 (split halves) 

2. Obj ective -Type Questions Only: Several Tests combined to 
generate course grade: 

Teacher J-#l: .75 (KR-20) 

Teacher J-#2: .72 (KR-20) 

Teacher J-#3: .75 (KR-20) 



3. Combination-Type Tests (Essays + Objectives): Single 

Test determined course grade: 



Teacher 


E: 


Impossible to calculate; technical 

reasons suggest low reliability 


Teacher 


F: 


.4 (co-efficient alpha) 


Teacher 


H: 


.52 (co-efficient alpha) 


Teacher 


I: 


.46 (co-efficient alpha) 


Teacher 


G: 


Impossible to calculate; technical 

reasons suggest low reliability 



4. Essay-Type Questions Only: Single Test determined course 

grade . 



■Teacher B: 
Teacher C: 



.3 (co-efficient alpha) 
.44 (co- efficient alpha) 



27 



30 



ft §> 



Teacher D: .4 

Teacher A: . 85 



(co-efficient alpha) 
(co-ef f ic'ient alpha) 



j ? ne thing should now immediately be obvious. Althouah 13 Q0l - Q 
St udy C clTar!/^^ 0 ^ 0 ° £ Che teachers in this 

generally accepted standards regarding ™renabilit°y ld 
■ ??" h ? d o d£Ci0n f al Ceacher <=*"* close. ?eaJher J assigned grades 

.l exfovmance on three different tests, each of which hid a 
St testl of -°re than .70. Since everts genelluy Igree 

tna ^. f est f that will be combined to generate orad^ a-r-o 
sufficiently reliable if they have ratings of around 50 Teacher 

lilhSf nf|f that standard - . Second, Teacher E assigned gradesS 
light of performance on a single exam. But, that sinale exam haH 
a reliability rating of .85. Since experts generally LxeTthlt 

^ t , hat 1 7 1 , 11 bS thS *** determinant of individual *2*15 
grade should have reliability ratings of at Feast 85 Telchlr j 

gSls y iT\ll J ht e of e n tln f ent standards - Anally, Teacher K assigned 

-“-ot xz&s* 

g^rST^ - the^the^teachers , 

universally ^gree^were S-5S£i^ Sf?” 52?™^“ 

4 • The B ar Exam, Reprise 

, hp noted earlier, one very important similarity exists between 
classes S' hofh ^nf A**"® that laW students take in their regular 

£» 5 ^dke°^an^ e ^nporcanc 10 educacionai an decision S£ abouC Ce an 

^^\h!r S Scus^ n o f t, ^^ it dt i-/- rn P T^ ia S c e hoo? 

classroom testing with a brief comment about the bar exam. 

s-,wf? e f tS _ know that subjectivity in grading decreases the 
reliability of tests. Thus, essay type tests -- which are <?radld 

S t ? s DaCtlVa faShion - tend to ba reliable than rtjective 

g? relSilitv o S f ThV'h ' Cherefore ' calls serious Question 

Of the elslv lorMnn V b ?? e J am ' ° r ' better sa id> the reliability 
z cn 7 essa y portion of the bar exam. Since this Dortion of rhi? 

exam is graded subjectively, reliability is likely P to be low. 

this N notioif PriS ?rll^ y ' ^ arly 'K ata re 3 ard ing the bar exam supports 
California v^a-r- ain descri ^ es ' for example, a study of the 
the Jaml pva b m examination that revealed that different /raders of 

Sis faTi o SV ( erS c |f ree i i re ? ar ding whether those answers should 

?lliabllirlofl^ 6?% ° f / he time - Thus ' the inter-grader 
mu' ^ scores f° r these essays was shockingly low. 

the same^ladfx' however /_ Produced even more shocking news. When 

— : ^ aders were asked to grade the same papers on different 

occasions, those graders agreed with themselves only 75% of the 



O 

ERIC 



28 





{ 



V 



f 



time. Later, studies confirm the existence of this problem. For 
example, Klein notes that analysis of essay examinations in' three 
states revealed reliability ratings in the low .70's. 23 (Exoerts 
g ! ne f^ y , a 5 ree » it should be recalled, that reliability ratines 
should be in the .85 range if tests are to be the sole d LS 
of important matters for individual people.) And, Gorfinkle and 
Klein conducted a study in ^which they wrote two different answers 
.° r t . he sam f essay question. The answers were substantively 
identical.. However, one of the answers was significantly longer 
than the other. Trained bar examiners were asked to grade these 
two answers and to ignore things like spelling, length or answer 
and grammar. Nevertheless, the bar examiners consistently gate the 
longer answers the better score. 24 * a 

It hardly need be said that bar examiners have gone to qreat 
lengths to reduce reliability problems with the bar exam. For one 
thing, most bar examiners now use the Multi -State Bar Exam, an exam 
that is graded objectively. Objectively graded exams, of course 
}° be ^ ore significantly more reliable than subjectively 
g.aded exams. in addition, bar examiners now generally use very 
sophisticated techniques in connection with the grading of essay- 
ype questions, techniques that produce surprisingly high degrees 
of inter- grader consistency. 26 Obviously, techniques that 
significantly decrease scorer variability significantly increase 
test reliability because score variability is one of the biq 
, on essay type tests . 27 Finally, most bar exams now have 
° parts ' an Active portion and an essay portion.. 
anal y sis neveals that the performance of individuals on 

S d f ffe u re ? ; parts of these exams is remarkably consistent 
(Students who do well on one part tend to do well on the other and 
vice versa.) This consistency, in turn, suggests a high level of 
consistency in measurement for the exam as a whole. 

suff ™*Z sb °f C ° f it is th * s • Thou 9 h bar exam perhaps at one time 

serious reliability problems, a high likelihood 
J mos ^ such e *ams now are reliable enough to use them as 

e oasis for making important decisions about individual students. 
f™ her ' t nd perha P s more significantly, the fact that bar 

SliiSn K in recent ^ ears taken or steps to increase the 

reliability of bar exams suggests that the notion of reliability is 

pi i rely academic exercise, something that classroom 
teachers need not address. 

^ • The Stand ard Error of Measurement- 

rh __ Si® r t Stil i gly ' ^ he notion of reliability does something other 
™" help ^achers (and schools) decide just how consistent the 
ar S ll j C6ly t0 be of tests thac teachers use to generate the 
i 9 de , f ° r a co V rse that students will obtain. Reliability 
aat a also plays a role in determining how much "error" must be 
taxen into account when scores from individual tests are evaluated. 



ERjt 



29 32 



n * 



This idea of accounting for error, in turn, is generally called the 
“standard error of measurement" in a test. 28 



1 . Teacher A 



Note quickly before the following "grade conference from Hell" 
that a major difference exists between the following conference and 
the ones present earlier. The following grade conference rests on 
data from a real teacher's course. In short, though conference 
itself described below is hypothetical, the grades discussed in it 
are real . 



Student 29: 



Teacher A: 



I really have only one question, Teacher A. I got 
84 points, and a B+. What was the cut off in point 
scores for A's? How many more points did I need to 
get an A? 

Well, let's see. Student 7 got 85 points and I' 
gave her an A. So, you missed an A by 1 point. 
Too bad. 



Student 37: 



What about me? I got 62 points and a C. What was 
the cut off for C+'s? 



Teacher A: 



Student 3 7 : 



Let me check. Well, I'll be darned. Student 15 
got 63 and a C+. You missed a C+ by a single 
point. 

Are you sure that a one point score difference 
between Student 15 and me justified a different 
letter grade for the course for us. 



Teacher A: 



Yes, of course. That's what the grades indicate. 



Student 29: [Cutting in] Well, I'm sorry to be a pest. But, 

Student 7 made the law review because of that A, 
and I missed out on law review because of that B+. 
And you know how incredibly big a deal law review 
participation is. Are you sure that a single point 
difference reveals a real difference in our work? 



Teacher A: 



I'm sorry. You just caught a bad break. 



Johns : 



Teacher A: 



You know, it's funny that you say it that way. I 
think that is exactly what happened. I think that 
I just caught a bad break. I think that a single 
point difference on your test is just a function of 
luck. I think that such a point difference doesn't 
really indicate any difference in performance at 
all. 

Don't be ridiculous. The numbers don't lie. 




30 




( 






33 



2 . Measurement Error 



Except in the most extraordinary circumstances, measurement 
instruments, including tests, contain at least some error. This is 
so even if the instruments have relatively high reliability 
ratings . . Admittedly, the amount of error in an instrument with 
high reliability is going to be substantially less than the amount 
of error in an instrument with low reliability, but some error is 
likely to exist even reliable instruments. 

Because experts. in measurement know that most instruments, 
including highly reliable ones, contain at least some error, such 
experts have developed a number of methods for determining the 
amount of error that exists in any instrument. This amount of 
error, in turn, is generally called the "standard error of 
measurement" in an instrument. In effect, the standard error of 
measurement is the margin of error that must be considered when 
measurements are evaluated. 

The work of public opinion pollsters provides perhaps the best 
known example of this concept at work. Pollsters know that 
measurements of public opinion always contain some error. Thus, 
when pollsters do their work, they first calculate the likely 
amount of error in their polls. Then, when they report results, 
they also report that margin of error. A pollster might report,’ 
for example, that 47% percent of the people in the United States 
approve of. the President's actions in a certain context. This 
pollster might also report in this context, however, that a margin 
of error of 3 points exists in the poll used. What the pollster 
means, of course, is this. The reported approval rating of 47% is 
not necessarily . the. "true" approval rating. Rather, the true 
rating is likely to be in a "band" of numbers that extends 
from three points above the reported rating to three points below 
rsported rating. In other words, the "true" approval rating in 
this instance could be as high as 50% or as low 44%. 

Educational measurement experts do a similar things when they 
report scores on tests. First, they determine the amount of error 
in a test itself. Then when they report scores on that test, they 
also report the margin of error in the test. A testing expert, for 
example, might report that a particular student obtained a score of 
47 on. a test. That expert might also report, however, that the 
test itself has a . measurement error of 3 points. What this would 
mean, of course - , is this. The reported score of 47 points is not 
necessarily the student's "true" score at all. Rather, the true 
score is likely to be in a band of numbers that extends from three 
points above the reported score to three points below the reported 
score. In other words, in the present example, the student's true 
score could be as high as 50 or as low as 44. 

As with many other aspects of educational measurement, no hard 
and fast rules exist regarding the- use that must be made of the 



c 



standard error of measurement notion. Nevertheless, educational 
measurement experts generally agree on one thing. Reported scores 
cannot be considered actually different from each other unless the 
error bands around those the different scores do not overlap. 29 In 
other words, scores are only truly different if the plus side of 
one person s error band does not overlap with the minus side of 
another person s error band. In short, differences in scores of 
less than twice the standard error of measurement of a test cannot 
be assumed to reveal differences in performance. 30 

The standard error of measurement for a test can easily be 
calculated, at least once the reliability of the test has been 
calculated. Indeed, the formula is perfectly straight forward. 

/ 

Error = \/ Variance of Total Scores * (l - Reliability) 

In words, the standard error of measurement of a test is the square 
root of the variance of the total scores on the test times (1 minus 
the reliability of the test) . Again, recall that spread sheets can 
instantaneously calculate the variance of a set of scores. 

All of this brings this analysis back to Teacher A's exam, a 
£~ ea l exam it should be recalled, given to real students, generating 
zeal grades.. Calculation establishes that the standard error of 
measurement in that exam is 6 points.. Consider, therefore, Teacher 
A's grades- and scores. 




85 




32 



Figure 8 



UL 


PTS 


0RD 


28 


91 


A 


1- 


ft? 


A 


2 


35 


A 


22 


84 


R+ 


?s 


87 


R+ 


74 


87 


R + 


H 


80 


B+ 


27 


76 


fl 


2 


76 


9 


17 


74 


9 


4 


70 


r+ 


15— 


69 




77 


68 


r+ 


5 


67 


r+ 


20 


67 


r+ 


26 


66 




21- 


64 


r+ 


15 


67 


0 + 


13 


£2 


r 


17- 


6 7 


r 


U 


61 


r 


19 


60 


r 


71 


S9 


c \ 


1 


Sft 


r. | 


IQ 


58 


r f 


14 


Sft 


r 


in 


Sft 


r 


16 


57 


0 


12 


ss 


r 


6 ' 


17 


n+ 


21 


52 


D+ 


12- 


51 


n+ 


13— 


4ft 


n+ 


3 


46 


n 


16 


44 


0 


a 


47 


n 


40 • 


76 


n 


27 


74 


F 


12- 


11 


F 


H 


10 


F 


- 41 


25 


F 



Teacher A, it should now be clear, does not deserve his name. 
Consider, for example, Student 7, the student who received the 
lowest A that this teacher gave. First, measurement experts would 
that 6 points must be subtracted from this student's reported 
score of 85 to account for the negative part of his own error band. 
Then, these experts would agree that another six points must be 
subtracted from this reported score to account for the positive 
of the error bands of students who obtained lower reported 
scores. In short, twelve points must be subtracted from Student 
7 ' s reported score of 85 before any kind of assurance can be had 
that differences in performance really exist. Consider, however, 
what Teacher a actually did. Student 29, with a reported score 
onl 7 one_point lower than 7's reported score, got a B+. This, 
unquestionably, was a mistake. Further, Students 2 and 22, with 
reported scores just 11 points lower than 7's, got B's. This may 
well have been a mistake. 



f) 



And consider what happened at the bottom of Teacher A's class. 
Student 12, with 33 points, got the highest F that Teacher A gave 
and Student 40, with 36 points, got the lowest D that this teacher 
gave. Error measurement analysis reveals, however, that assignment 
of different grades to these different students simply was not 
justified. Differences in reported scores of 3 points simply 
cannot be considered to reveal differences in actual performance. 

3. Real World Data 

What then of real world data? 

Recall again that in connection with the preparation of the 
present analysis, 13 sets of real grades were examined. Standard 
error of measurement analysis demonstrated a startling fact. Every 
single one of those sets displayed serious standard error problems . 
In other words, every set of grades examined showed that teachers 
gave different letter grades to students even though the test 
performances of those different students simply could not 
realistically be considered different. 

Consider, just by way of additional example, the scores and 
grades of Professor C. C gave an exam made up of a series of 
essay- type questions to approximately 50 students. The top scoring 
student got 162 points on this test, and the bottom student got 120 
points. And the rest of the students got just about every number 
of points in between those two extremes. The standard error of 
measurement for this test was 8 points. In other words, on this 
test students' "true" scores could be anywhere between a number 8 
points above their reported scores and a number 8 points below 
their reported scores. But, C repeatedly gave different grades to 
students whose reported scores differed by only one or two points. 

An extraordinarily important point must now be made. The 
standard law school examination system --a single final exam made 
up predominantly of essay- type questions -- almost necessarily will 
produce standard error problems. This is so, of course, because 
that system almost inevitably will involve use of tests that 
contain fairly large error components. And it is so because this 
system will tend to produce sets of scores that move in small 
increments for top scores to bottom scores. These facts suggest, 
in turn, that that traditional system almost necessarily will cause 
many law school teachers to make serious measurement error in 
connection with the grades they assign. 

Sadly, no easy solutions exist to this problem. Two possible 
approaches, however, come to mind. First, law teachers could move 
to a grading system that involved multiple components, several 
tests, for example, or several tests and papers and quizzes. Since 
measurement errors tend to cancel themselves out when multiple 
components make up grades, this approach would essentially solve 
standard error of measurement problems. Second, for teachers- who 



ERLC 



34. 



f 



do not wish to use multiple components -- and, frankly, few law 
teachers are likely to move to that approach -- error band 
calculations could be built into the grading scale itself 

f ° r e y Lm ^ 1 *V* could openly admit to their students that 
h if if tter . Srade differences in scores almost certainly do not 
reflect real differences in performance. A B+, these teachers 
might candidly admit, could just as easily have been an A or a B 
Once teachers made these admissions, they could simply make sure 
that appropriate error bands separated students with full letter 
g 5 ad f. oifferences in scores. And, these teachers could just let 
the fickle hand of luck cancel out standard error problems for the 
gn tire . curriculum . --You might have gotten a half letter grade 
lower in my class than you deserved, " these teachers might tell 
disgruntled students. "But, chances are that in other classes you 
got a half letter grade better than you deserved.” 

^ ■ ^sgal — Implicati ons of Standard Error Issues 

Interestingly, standard error of measurement problems have not 
for the most part been explicitly addressed in connection with the 
bar exam _ Thus, sadly, no direct data from that exam can be added 
to this discussion of law school grading. Nevertheless, several 
points can yet be made. First, since the reliability of both the 
essay portion and the objective portion of the bar exam seems to be 
h i gh ' standard error problems should not be much of an issue 
°u if i ar exam * standard error of measurement in a test, it 

should be recalled, is inversely proportional to the reliability of 

In ° ther words ' if a test is high in reliability, its 
standard error of measurement is likely to be low. Second, as 
noted repeatedly, when testers combine scores from different exams 
measurement errors tend to cancel themselves out. This is so, of 
course, beceuse good luck on one exam is likely to be balanced by 
bad luck on another. Since the bar exam in most jurisdictions is 
made up of what clearly are two different exams -- the essay 
portion and the objective portion -- overall scores on that exam 

are less likely to be affected by error problems than scores on 
single exams. 



ERIC 



One last point must yet be made in this context. Although law 
schooi teachers for the most part seem to know nothing about the 
standard error of measurement, and although bar examiners do not 
seem regularly to discuss this issue, the standard error of 
measurement is not simply an educational abstraction, something 
that nobody other than statisticians addresses. Rather 
measurement error is a concept that has real world vitality, even 
in the courts . Craik v. Minneso ta State University Board , for 
examp e, contains an extended discussion of this notion generally, 
an explanation of how it works. 31 

„ _ A muc ^ niore significant case in this context is Georgia State 

Co nference of Branches of NAACP v. state of 32 This case 

involved a civil rights action brought by a group of African- 

35 



American school children. One aspect of that case involved the 
method used to assign children to "special" education classes. Not 
surprisingly, that method included the use of an IQ test Students 
who scored below a certain point on that test, and who fit other 
criteria, were assigned to special classes. Conversely, children 
who scored above that point, and who otherwise qualified, went to 
regular classes. The question then arose as to how precise scores 
on that IQ test were. Not very precise at all, it turns out, as 
the district court, ruled. The appellate court affirmed in language 
that seems to have great applicability in the present 
circumstances. 

The [district] court's construction of the I.Q. score 
regulation was based on the factual finding that 
including a standard error of measurement is sound and 
that _ the range suggested by the AAMD [American 
Association on Mental Deficiency] is professionally 
desirable. Although the state regulation does not 
explicitly refer to the standard error of measurement, a 
number of experts testified at the trial that inclusion 
of this amount of flexibility in considering I.Q. scores 
is necessary for a meaningful cutoff. See e.a. -Record, 

Vol . 49 at 2126-28,2135 (testimony of Dr. Kicklighter) 
(standard error of measurement is an intrinsic part of 
I.Q. test) . Furthermore, these is substantial evidence 
in _ the record supporting the view that- the AAMD 
guidelines are acceptable .professional tools. 

Again, the overall point is clear. Measurement notions that 
most law school teachers might think of as obscure technicalities, 
in fact have a real world, judicial world existence. Real people's 
lives are affected by these notions, and the courts are aware of 

that fact. Or, better said, at least in some contexts courts are 
aware of that fact. 



HI- Grading Decisions and the Courts 

If the empirical data just described presents a picture that 
is at all representative, and if.- empirical data discussed earlier 
f-lso presents a representative picture, then a substantial 
-likelihood exists that many law school teachers make a number of 
serious measurement errors in connection with the grading of law 
school exams and the assigning of letter grades pursuant to scores 
received on those exams. The question thus arises as to what, if 
anything, law students can do about these errors. Particularly, 
the question arises as to whether law students might find a 
sympathetic ear in court. 



Two separate but parallel sets of cases and ideas address the 
legal rights of students who disagree with grading decisions. One 
of those sets, a set that is very well known to teachers, deals 
with the notion of "academic challenges." The other set of cases, 




36 

39 



* 



r 

\ 

a set that seems to be unknown to most teachers, involves what is 
sometimes referred to as "high stakes" testing. Interestingly, 
these two sets of cases seem to point in opposite directions when 
it comes to students' rights. 

A. "Academic Challenge" Cases 

"Susan M" is one of the most recent students to file what is 
sometime called an "academic challenge" lawsuit. 33 Susan, like 
other students who have filed these lawsuits, claimed that the 
grades she received in individual classes should be changed, and an 
expulsion decision made in light of classroom grades should be 
expunged. Susan M flunked out of law school in the late 1980' s. 
Unfortunately for Susan, and students in comparable cases, a widely 
accepted rule exists for dealing with this sort of situation. This 
described most vividly in the United States Supreme Court 
case of Board of Curators. University of Missouri v. Horowitz . 34 

states in no uncertain terms that teachers and schools 
have an enormous amount of discretion when it comes to grading / 
expulsion decisions. Indeed, this rule states that absent the most 
extraordinary circumstances., students simply have no judicial 
recourse whatsoever when it comes to grading / expulsion 
disputes. 3 

( Seemingly, the Horowitz rule makes it pointless for students 

to file academic challenge cases. And, frankly, the cases 
preceding and following Horowitz confirm that idea. Thus, when it 
comes to classroom grading / expulsion decisions, the protective 
arm of discretion provides teachers and schools with almost 
complete protection. Having said that, however, an important 
caveat must be made. Two small but potentially significant 
possible exceptions to the Horowitz rule seem to exist. 

In its most recent discussion of academic challenge issues, 
the Supreme Court specifically noted that at least one kind of 
situation exists in which students might prevail in an academic 
challenge case. The judgment of a teacher or a school can be 
overturned, the Court noted in Regents of the University of 
Michigan v. — Ewing, if that judgement constitutes a "substantial 
departure from accepted academic norms." 36 What the just quoted 
phrase from Ewing means, of course, is not at all clear. Probably, 
however, this phrase reflects the Court's concern that unprincipled 
teachers and schools might try to use the cloak of grading 
discretion to protect themselves from well-founded claims of 
deprivation of civil rights. It is possible, for example, that an 
unprincipled teacher or a group of unprincipled teachers might 
dismiss an African-American student from school despite that fact 
that his academic work is no worse than that of white students. 37 

Likewise, it is possible that an unprincipled teacher might fail 
a student who did acceptable academic work because that student 
controversial political statements. Since in both of these 
situations, the teacher's grading judgment constitutes 

az 40 



o 

ERLC 



a 



JJ v * 



substantial departure from accepted academic norms, in both of 
these situations an academic challenge might succeed. 

The second possible exception to the Horowitz rule grows out 
of Maitland v. Wayne State University Medical School . 38 a case that 
is just about the only modern academic challenge case in which a 
student prevailed. Several problems occurred in connection with 
the "comprehensive" pass / fail test that Maitland took at the end 
of his second year of medical school. First, proctors of this test 
employed different procedures in the two different rooms in which 
the test was administered. Second, some sort of "error... in the 
grading process" initially occurred. When that error was 
corrected, Maitland got 20 points more than his original score. 
Third, the passing score for the "retake" exam was set at a higher 
figure than the passing score on the original exam., (If the 
passing score on the retake had been set at the same figure as the 
passing score on the original exam, Maitland's retake performance 
would have been a pass . ) 

Interestingly, something that herein is called "measurement 
error" played a critical role in the student's success in Maitland . 
During deliberations regarding grades given on the exam just 
described, the Chair of the pertinent faculty committee asked for 
a statistical analysis of the scores obtained in the two different 
rooms. The Chair sought that analysis, of course, to determine 
whether the different procedures followed in the different rooms 
affected-scores obtained on the exam. Unfortunately, however, the 
Committee chose not to wait for the results of this statistical 
analysis before deciding Maitland's appeal. Thus, the committee 
ruled twice against Maitland prior to receipt of that analysis. 
This proved to be a fatal mistake. Though the statistical analysis 
ultimately showed that the different procedures used in the 
different rooms did not affect scores obtained, the court concluded 
that the Committee's failure to wait for the results of that 
analysis was educationally irresponsible. Hence, Maitland's 
challenge was accepted. 

Note carefully now an important point. The statistical issue 
just described was only one of the reasons that the student 
prevailed in Maitland . Equally important in that case was the fact 
that the court also thought that Maitland's school had erred when 
it allowed people with lower scores than Maitland on the original 
test to retake the exam without filing formal appeals. This second 
reason for decision no longer holds up. Ewing , which was decided 
after Maitland, categorically states that the fact that Ewing's 
school allowed students with lower grades and scores than Ewing to 
continue in school despite expelling Ewing made no difference. 
Schools and teachers, the Court noted in Ewing, could weigh all 
sorts of intangible things when making grading decisions and still 
not fear judicial intervention. 



O 

ERIC 








38 



The short of it is this. Given the fact that one of the 
grounds for decision in Maitland no longer is sound, it is entirely 
possible that Maitland itself would be decided differently now than 
it was before Ewing was decided. Nevertheless, the statistical 
analysis / measurement error point addressed in Maitland perhaps 
still is sound. Perhaps, in other words, the failure to consider 
statistical evidence of measurement error associated with tests is 
a "substantial departure from accepted academic norms." 

B . " High Stakes" Testing 

As noted earlier, academic challenge cases generally involve 
complaints that students make about the grades that they receive 
from classroom teachers, or about expulsion decisions that 
university officials make in light of classroom grades. Academic 
challenges like this almost always fail. Maitland , however, did 
not really involve classroom grades, nor an expulsion decision 
based on classroom grades. Rather, Maitland involved the grade 
received on a single, extraordinarily important test. Maitland , 
therefore, is not really an "academic challenge" case. Rather, it 
is something like a "high stakes" testing case . 39 

Debra P, like Susan M, was negatively affected by a testing 
decision . 40 Debra, however, was not concerned about classroom 
grades, nor an expulsion decision. Rather, Debra was concerned 
about a "minimum competency" test that the State of Florida had 
decided had to be past before high school diplomas could be 
granted. When Debra failed this test, she sued, claiming, among 
other things, that this test invidiously discriminated against 
members of minority groups . 



High stakes testing, which was what was involved in Debra P's 
case, occurs when the score or grade that individuals obtain on a 
single test has enormous consequences for the individuals involved. 
The minimum competency test Debra P failed, of course, was a high 
stakes test. Failure to pass it meant that students did not get 
high school diplomas. The LSAT also is a high stakes test. Scores 
that individuals obtain on this single test can have life- changing 
impact. A very high score on this test, after all, might lead to 
admission to an "elite" law school. Conversely, a slightly lower 
score on this single test might limit admission to only a 
"national" school. And, lower scores yet might -limit admission to 
"regional" or even "local" law schools. The LSAT, of course, is not 
the only example of a high stakes test. In fact, countless high 
stakes tests exist. Most tests associated with admission to 
educational institutions, for example, are high stakes tests. 
Thus, the SAT is a high stakes test, as is the GRE (Graduate Record 
Exam) and the MCAT (Medical College Admission Exam) . Further, 
licensing and certification exams can be high stakes tests. Bar 
exam failure, after all, can have life changing consequences, as 
can failure of a teacher certification exam. Finally, tests that 
individual employers might use to screen potential employees, or 




31 



42 



>1 ^ 



tests that individual employers might use in connection with 
promotion practices, can be high stakes tests. The scores that 
people get on tests given to potential police or fire officers for 
example, can have a life changing impact. orncers, for 

Countless people other than Debra P have filed law suits in 
connection with high stakes testing. For example, teachers who 
failed re - certification exams have repeatedly claimed in court 
that the exams involved discriminated against members of minority 
groups. These kinds of law suits, obviously, are high stlkeJ 
ca ® es ; 4 ' again by way of example, a group of fe£e 

that the ca? l0St ° Ut ° n a maj ° r schola rship claims in court 

^ at _ ^ . SA ^ discriminates against female students. 42 This again 

Innrl f? n a 1 S testing case. In addition, people who have scored 
poorly on employment examinations have repeatedly sued to set the 
test results aside. 42 Finally, students who have been assigned to 
pecial educational programs, or not assigned to such programs 
because of performance on individual tests, have often sued to set 
the results of these tests aside. 44 Again, obviously, high stakes 
tests were involved in these cases. 9 

r -, a .-_S terest;Ln ? 1 Jf courts that have dealt with high stakes testing 
claims -- including the United States Supreme Court 46 -- have 

rhS??i SXaCtly thS °PP° slte of what courts have done in the academic 
challenge cases. As noted earlier, in academic challenge easel 

at all^^ elY PlaC 3 th i bUrd6n ° f provin ^ substantial violat^nl 
Dla inf ^ccepted educational norms squarely -on the student 
p aintiffs. . Since student plaintiffs can virtually never do this 
P i a £ ntlffs almost always lose these cases. 7 Conversely, iA 
stakes testing situations, the courts routinely place the 
burden of ^proving compliance with generally accepted educational 
norms on the proponents of a test or scoring methodology. In other 

s J; akes testing cases, the test-graders rather than 
the test -takers have the burden of proof. 



r 



c 



fhe y cases illustrate this point, with Debra P's being perhaps 
the most important. Of particular interest in the present 

context, however, is a 19 81 case, Delgado v, McT-ighe 48 j n t his 

hI S h = H t ? e 'f 1 ^ lntif ^ inSiSted thaC Che Mul ti-State Bar exam, which 
ind ^ ad rt f ^ led ' could not be used to important decisions about 

raisld^' iinh :uden , ts * In other w °rds, in this case, the plaintiff 
?t' S kBS testlng iSSue - Once this was done, the burden 
Jttl ^ the testers - But ' the testers met the burden. The bar 

exam, the court ruled, was sufficiently reliabile for use in a high 
stakes testing situation. 3 

herei^aTW^ • t0 ° Pi™ 2 ' consuming to attempt to describe 

nerein all of the things that courts have required high stakes 

testers to do in order to show the soundness of the tests involved 

two quick summary points can be made. First, several 
commentators have suggested in recent years that perhaps the best 



O 

ERiC 



40 



43 



( 



thing that high stakes testers can do is follow the "Standards for 
Educational and Psychological Testing, " standards created by the 
American Psychological Association, the American Educational 
Research Association and the National Council on Measurement in 
Education. These standards reflect the best current wisdom 
regarding educational measurement issues. 49 Second, one 
commentator who has written extensively about the legal 
implications of high stakes testing — S. E. Phillips — concluded 
a recent^ essay with a list of recommendations for high stakes 
testers. Several of these recommendations deserve quotation. 
High stakes testers, Phillips insists, should engage in a process 
that includes "collecting appropriate validity and reliability 
svidence, constructing and evaluating tasks according to 
professional standards, [and] setting passing scores based on 

.professionally-acceptable methodology " Later, Phillips notes 

that high stakes testers should also ' "obtain (and follow) the 
advice of a technical advisory committee composed of nationally 
recognized experts in psychometrics who have had experience with 
the particular testing application " 

Note carefully now two important final points about high 
stakes testing cases. First, since high stakes testing cases have 
in the past invariably involved claims of civil rights 
deprivations, it could perhaps be argued that the existence of 
civil rights claims is a pre - requisite to success these cases . In 
other words , it is possible that individuals can succeed in high 
stakes testing cases only if those individuals can show two things: 
First, successful claimants in these cases perhaps must be able to 
show that a problem exists in connection with an educational 
measurement issue. Second, successful claimants in these cases 
perhaps must be able to show that that measurement problem leads to 
a civil rights deprivation. 



Careful analysis reveals that this in fact should not be the 
case. Consider the following. Assume that a teenage prodigy, 
Jeannie Genius, takes the SAT and scores 799 points out of 800. 
Assume also, however, that an Jeannie publishes a description of 
her . high school science project in journal, Science . In this 
article, Jeannie demonstrates that the question that she missed on 
the SAT was mis -scored. Her answer, which was marked wrong, was 
actually the correct answer. In other words, in this article, 
Jeannie demonstrated that the Educational Testing Service (ETS) , 
which drafts and scores the SAT, goofed. Finally, assume that ETS 
has now dug in its heels and refuses to change Jeannie' s score. 
Thereafter, Jeannie sues. Question: Can Jeannie succeed only if 
she proves a violation of her civil rights? Must she prove, for 
example, that the SAT invidiously discriminates against geniuses? 
Or can Jeannie win simply by proving that ETS was wrong? Surely 
the latter. But if the latter, then high stakes testing cases 
generally need not necessarily include assertions of civil rights 
violations. 




41 



The point, of course, is this. Though claims of civil rights 
deprivations might be useful in connection with high stakes testing 
cases -- perhaps by helping claimants obtain statutory authority 
for their claims, the existence of educational measurement error 
should by itself be sufficient to gain a hearing in high stakes 
testing situations. Admittedly, deprivations of civil rights are 
serious wrongs. But, they are not the only kinds of wrongs that 
exist. 

The second thing that must yet be said about high stakes 
testing cases involves an issue that was addressed in the Maitland 
case. Recall that in Maitland , which is just about the only 
example of an academic challenge case in which a student prevailed, 
the court was particularly bothered by the Committee's failure to 
wait for the results of a statistical analysis of the test results 
involved. Statistical analysis of test results, the judge seemed 
to think, is not something that interferes inordinately with the 
discretion of teachers and university officials. In other words, 
mere numbers and calculations do not seem to be things that are 
subject to discretion. The same thing can be said about most of 
the high stakes testing cases. Virtually all of these cases 
involve careful analysis of statistical evaluations of the tests 
involved. In other words, number crunching plays an important role 
in these cases. 



All of this, finally, brings this analysis to a very brief 
comment about law school grading. Everyone familiar with legal 
education knows that most law school teachers base the entire grade 
for their "courses on students' performance on a single final exam. 
Everyone familiar with education generally also knows that teachers 
in virtually all other kinds of institutions base the grades for 
their courses on a number of different tests, or on a number of 
tests and papers, or on a number of tests and papers and quizzes. 
These facts raise a series of interesting questions. Are the 
classroom grades that teachers in law school give akin to the 
grades that teachers in other kinds of educational institutions 
give? Alternatively, are the tests given in most law school 
classes "high stakes" tests-. If the former, then law students who 
disagree with the class grades that they receive and law students 
who disagree with expulsion orders based on classroom grades, must 
deal with the Horowitz rule. Thus, for all practical purposes, 
such students should not even consider filing suit. But, if law 
school grading involves high stakes testing, then, perhaps, law 
students have considerable judicial recourse regarding classroom 
grades and expulsion decisions. 



A couple of summing up points must now be made. Some critics 
of the foregoing might note that few teachers actually do the kinds 



C . Law School Grading: What is it? 



IV. Conclusion 



42 



ERIC 




of calculations described herein. Few teachers, these critics 
might insist, actually calculate Z-scores -- and thus avoid 
"weighting" problems. Further, critics of the foregoing might 
insist that few teachers actually calculate the "reliability" of 
their classroom tests, and thus few teachers actually know; whether 
they should or should not assign grades in light of those tests. 
Finally, critics might argue that few teachers actually calculate 
the standard error of measurement in their tests. Since most 
teachers do not do these things, these critics will insist, the 
failure to do them cannot possible constitute a substantial 
deviation from accepted academic norms. 

Two things must quickly be noted about these criticisms. 
First, the failure by many teachers to do certain kinds of things 
surely is evidence of accepted academic norms. Thus, if law school 
grading cases are "academic challenge" cases, then, perhaps, 
students will lose. But, as noted repeatedly herein, law school 
grading cases may actually be "high stakes" testing cases rather 
than academic challenge cases. If this is true, then the burden of 
proving compliance with accepted academic norms shifts to the 
proponents of the test. Second, even if law school- grading 
disputes are academic challenge cases -- and thus subject to the 
Horowitz_/ Ewing rule -- students still might have a chance, this 
is so, in turn, because educational practices outside of the law 
schools may not' establish the standard of review for law school 
grading . 

Consider again the criticism just noted, namely, the assertion 
that many teachers do not do the kinds of statistical analysis 
described herein. Even if that is true, it matters nothing. A 
significant difference exists between the grading practices of law 
school teachers and the grading practices of virtually all other 
teachers, a difference that plays a powerful role here. Law school 
teachers generally assign grades in light of performance on a 
single exam. Conversely, virtually all other kinds of assign 
grades in light of several tests, or several tests and some papers, 
or tests and papers and quizzes and class participation. In other 
words, the performance on individual tests has much, much less 
impact outside of the law schools than in the law schools. This is 
critically important . As teachers rely on more and more factors in 
connection with the assignment of grades, the individual impact of 
measurement errors such as those described herein tends to become 
less and less important. If a teacher gives grades in light of 
performance on five exams, for example, it probably matters not at 
all that none of the exams have reliability ratings of more than, 
say, .50. Bad luck on one test, after all, will tend to balance 
out good luck on other tests. Further, as teachers use more and 
more items in connection with the calculation of grades, 
differences between the performance of different students will 
become more and more pronounced. Thus, technical issues regarding 
error measurement will play- less and less a role. 



The point is this. Because the tests that students take in 
law school classes are so commonly used as the sole ripfprmi nant of 
important matters for these students, these tests’ are not real lv 

liaised ^ C t°her he Jw StS h C1 f C SCudents take in kinds of 

fcf? law school tests are roughly comparable to the 

LSAT, or the GRE, or the Multi -State Bar Exam. These other tests 

^ ^ h ° 01 Classes ' are the sole determinants of 
Thus ' when grading practices in the law schools 
are considered, it matters not so much what teachers in other kinds 
of institutions do. Rather, it matter what is done by people who 

S d anal y ze tasts llke the GRE and the LSAT and the Multi - 
B J r Exam ' And, not surprisingly, people who work on these 
kinds of tests subject them to enormously rigorous scrutiny. 

One last response to the foregoing criticism must yet. be made 
As suggested repeatedly herein, calculations of the kind described 
herein in fact are difficult to do if teachers do not have access 
to a computerized spread sheet. Thus, since many teachers outside 
of the law schools simply do not have access to computerized spread 
sheets, a ready explanation for the failure to do these things 
exist. Law school teachers, however, have no such excuse. At the 
present time, many law school teachers have powerful com.Duters on 
their oym desks, or have easy access to such computers* in their 
schools libraries. Further, all law school teachers presently 

their ^e^ks aCCeSS t0 secretaries who have powerful computers on 

The bottom line -- to use "lawyer -speak" -- is this. Law 
school teachers who make the kinds of measurement errors described 
herein make those errors not because they choose to exercise 
discretion and do different things. And, law school teachers who 
make these kinds of errors do not make them because they do not 
have access to appropriate computer equipment. Rather, law school 
teachers who make these kinds of errors make them because they are 
court° r 1 ^ norant ‘ And think what kind of defense that would be in 

These thoughts, in turn, bring this analysis to one last qrade 
conference from Hell.. This grade conference does not take place in 

court t room h ° ld ° f & teacher ' however - Rather, it takes place in a 



Students' Attorney: 



ERjt 



Just to remind you quickly,. Judge, I 
represent two students. One of them 
missed out on graduating Number 1 in her 
class because she got a half letter grade 
lower score than a classmate in one 
class. That classmate, incidentally, is 
now clerking for a judge on the United 
States Supreme Court. My other client 
got an F in a course and flunked out of 
school . Both of these students believe 

44 



Law School Attorney: 



Students' Attorney: 



Law School's Attorney: 



Students' Attorney: 



O 




that their grades were significantly 
affected by "measurement error." 

Judge, I'm sure I need not remind you 
that all grading decisions made by 
classroom teachers involve discretion. 
Just read the "academic challenge" cases. 
Read Horowitz . Read Ewing . . Those cases 
clearly require this action to be 
dismissed outright. 

Not so fast, Your Honor. Two points. 
First, the "high stakes" testing cases 
require testers to prove that tests are 
accurate measures. And law school 
grades, at least those given when the 
teacher uses only one test, are a type of 
or hybrid form of high stakes testing. 
Second, even if Horowitz and Ewing apply, 
the present situation -- and I think they 
do not -- these situations do not involve 
any kind of discretion. We're not saying 
here that the teachers should have given 
the students additional points on an 
essay, or that the number of letter 
grades given was wrong. Those kinds of 
things, which were involved in Horowitz 
and Ewing in fact are discretionary and. 
should not be reviewed by courts . What 
we're saying is that the teachers here 
made essentially mathematical errors. 

Judge, we're talking here about obscure 
statistical ideas. Nobody pays any 
attention to this stuff. 

No, Judge. We're talking here about 
human lives. 



45 



Grade Conferences From Hell: 

Error in Law School Grading 



Measurement- 



Notes 



— , Gronlund and Linn explicitly s t ate what many lav fparho rc 
pr obably feel. N. Gronlund an d R.L. Lin n. 

valuation — in — T eaching (6th ed. 1990) at 470: "No maior 

educational decision should ever be based on a test score alone " 

r - '- N umerous books on educatio n al measuremen t exist. Three books 
ho wever, stand out as e xcept ionally useful to beoinnirs . hooks Thjt 
a re simultaneously comprehensive but understandable. R.Ebel. and 
— — ^ s ^ ie < — E ssentials of Educ a tional Measurement (5th Ed 1991) • 
W '^ hre ^ S , and 1 Lehmann ' Measureme nt and Evaluation in Education 
and Psychology (4th Ed. 1991) ; G. Sax, Principles of Educational 
a nd Psychological Measur ement and Evaluation (3rd ed. 1989) These 

relies arS thS b °° kS upon which the present analysis principally 



~~ — : Readers — interested — in more sophisticated discussions of 

~ ~ ^ t '^ S ^ 1CS t ^ ian t ~^ ie °nes contain e d herein might consult: E.V. Glas s 

■ K ^ D '-. qq^ 1]1S ' — S tatistical Methods in E ducation and Psychology 

(2d ed ; 1984), or F.J. Gravetter and L. B. Wallnau. Statistics for 
t he Behavioral Sciences (2d ed. 1988) . ‘ 

Bee ‘ — a • S • — B escy i — ^Setting St andards and Cut Scores: Whe re Do 

~ P raw „H he . , Pass — L — — kine? ■ — 57(4) Bar Examiner 17 (1988) ; 
Kurdys, Grading Essay Answers: The Issue of Reliability in Essav 
Scoring " 59(4) BarJSxaminer 22 (1990); Lenel, "Issues in Equating 

and Combining MBE and Essay Scores, 61(2) Bar Examiner 6 ( 1992 V • 
^ ' T ® st Nidation: What It Is and How It Should Be Done, " 

60(3) Ba r Examiner 5 (1991) ; Lenel, The Essay Examination Part Ill- 
Grading the Essay Examination," 59(3) Bar Examiner 16 (1990). 

— — r ,5 US s i ons — of — "weighting" is sues generally, see Ebel . 

f ~ UPr , a ' ab — et „ seq? Mehrens, supra, at 49 1 et sea.: Sax, simra. 

r— — — - seq< an d et - seq: See also, P. W. Airasian. Classroom 

Msessment (1991) at 339-44 : ' N. E. Gronlund and R.L. Linn, 
M easurement and Evaluatio n in Teaching (6th ed. 1990) , at 43 7 - 39 • 

L„^? klnS ' , i’ . Stanle ^' and B - Hopkins, Educational and 
^ SyCh0l °? 1Cal Measurement and Evaluation (1990) , at 331 et seq; 
7; Z, ' J j° ph ™ « ^ddern Educational Meas u rement: A Practitioner's Guide 
(2d Ed. 1990) at 378 et seq~ 

— ° r — discussions — of — "weigh ting" issues generally, see Ebel. 
at 276 — seq; Mehrens - supr a , at 491 et sea,: Sax, supra . 

g r 204 et seq L and 539 et seq. See also, P.W. Airasian. Classroom 

M sessment (1991) at 339-44: N.E., Gronlund and R.L. LimT 

Measuremen t and Evaluation in Teaching ( 6 th ed. 1990), at 437 - 39 . 
K. Hopkins, J. Stanley, and B. Hopkins, Educa tional and 
P sychologi cal Measurement and Evaluation (1990) , at 331 et seq-' 

ERIC 



1 



1. 



W.J. Popham, Modern Educational Measurement: A Practitioner's Guide 
(2d Ed. 1990) at 378 et seq. 

7 . This example is drawn directly from Gronlund and Linn, supra, 
at 438. 



8 . Gronlund and Linn describe an additional method exists that 
allows teachers to combine scores without risking the weighting 
problems just described. N. Gronlund and R.L.Linn, Measurement and 
Evaluation in Teaching at 438-439 (6th Ed. 1990) . Regrettably, 
however, this method only works when two components are to be 
combined and given equal weight. Teachers who wish to use this 
third technique must do four things. First, they must determine 
the range of scores on both of the two components. If, for 
example, scores on the first component range from 100 to 80, then 
the range on that component is 20 points. If scores on the second 
component range from 50 through 10, then the range on that second 
component is 40 points. Second, these teachers must divide the 
ranges by each other to generate a "weighting" factor. In this 
case, therefore, 40 divided by 20 is 2. So, the weighting factor 
is 2. Third, to equalize the scores, the scores on the component 
part with the lower range of scores is multiplied by the weighting 
factor. Fourth, and finally, the teacher then adds up the 
multiplied score from the one component and the raw score from the 
other. 

9 . This technique is suggested bv Gronlund and Linn, supra, at 
438-39 . * 



10 . It . hardly need be said that this notion of "standardizing" 
scores is not widely known to lawyers and legal educators . 
Nevertheless, this notion has appeared from time to time in 
materials associated with the law. Merritt and Reskin used this 
notion, for example, in connection with their analysis of law 
school employment practices. Merritt and Reskin. "Double Minority: 
Empirical Evidence of a Double Standard in Law School Hiriqn of 
Minority Women. 11 65 S. Cal. L. Rev. 2299 (1992) at notes 56 - 57. 

See also, Garcia and Steele, "Mentally Retarded Offenders," 41 
Ark . L . Rev . 809 (1988) at note 24. Perhaps the most thorough 

discussion of this concept in the law literature, however, and 
surely the most interesting in connection with the present 
analysi2, involves the Multi-State Bar Examination. Lenel, "Issues 
in Equating and Combining MBE and Essay Scores," 61(2) Bar 
Examiner 6 (1992) . 



11 • Standardized scores are discussed in virtually all books on 
educational measurement-. See, e.g. Ebel, supra, at 68 



12 . For one of countless discussions of this formula, see Ebel^_ 
supra, at 68. T-scores are a simple derivation of Z-scores. 
Instead of assigning the mean a value of zero, as Z-scores do. T- 
scores assign the mean the value of 50. Thus, a T- score above 50 
is a score that is above the mean, and a T-score below 50 is a 




2 




s core below the mean. T scores, which are napfni because 

apt contain negative numbers, are cal c ulated bv mil t~i P i y-i 
BS Ttinent Z- score by 10 and then adding 50 to the rpauit,^ 

The formula is: T = ( (z-score) * 10) + so 



13 



Klein ' — !LAre Your Test Scores Only Half Safe? 1 1 4 a 1 1 \ _ 

nar 1 7 7 MQgcn J — 0 ^ x ' — Sal. 



Examiner 137 (1979 

14 . Id. at 139 

15 . Id. 



ift ~z ^1 142 ‘ ,.. Klein also discu sses this issue in other wnrVs 
f ee 1 — e • 3. Klein , — On Testing V: How to Answer the Critic s » 55(1) 

g ar Examine r 16, 22 - 23 (19 86). lie also, Lenel , "issues in 

^1992 )" n9 and COmbinin9 MBE and Essa y Scores," 61(2) Bar Examinar 6 

~ J ! - See Ebel , — s upra , — at p . — 76_ et sea.; Gronlund . supra , at 77 et 
Se< *' and 101 et seq; Hopkins, supra, at 113 et seq.; Tuchman 
279^81 3 P ‘ 146 ' 47; Pophain ' su P ra at 121 et seq.; Sax, supra, at 

formul a is discussed, among other places, at Ebel. supra. 



formula is discussed, among many other places, at Ebel. 



Sfr ■ ~- e Ebel supra, at p 86: N. E , Gronlund and p . T. . Linn 
Measurement and Evaluation in Teaching . (6th ed. 1990) at 77 et 

pfJuf nd M 1 a 1 et r-a eq; Tuchman ' supra, at p. 146-47. See also, w. J. 
Poph^ M°dern Educational Measur e ment: A Practitioner's Guide (2d 
ed. 1990) at 121 et seq.; Sax, supra, at 279-81. 

2.1 • Ebel. supra, at p. 86. 

21^ j Cl ein, "Are Your Tests Only Half Safe?" 4 8 (1) Bar Examinar 137 
23_j_ Id. at 138. 

7CT in Klei ”' ’ ,Qn Testin g iv ; E ssay Grading Fictions. Facts 

a nd Forecasts , 54(3) Bar Examiner 23, 24 (1985) . • 

R egrettably, the National Con fere nce of Bar Examiners does not 
_ eqularly — publish — data — regarding the reliab ility of the MBE. 
N evertheless, it appears that the MBE ha s a reliability rating that 
would satisfy generally acce p ted standards. See, id. at 138. See 

. also, Klein , — IQn Testing: V: How to Respond to Critics," 55(1) Bar 

E xaminer 16 (1986) (discussion of reliability studies of MBE). 



3 

ERIC 



51 



26 ., _ For discussi ons of these techniques, see Lenel . "The Kssay 

Examination Part III: Grading the Essay Examination." 59(3) Bar - 

Examiner 16 (1990) . See also, Klein, "Essay Grading: Fiction, 
Facts and Forecasts," 54(3) Bar Examiner 23 (1985); Kurdys, 

"Grading Essay Answers: The Issue of Reliability in Essay 

Scoring," 59(4) Bar Examiner 22 (1990). 

22 - l . I£— Is surpris ingly simple for teachers to determine how much 

inconsiste ncy exists in their own grading of essav-tvue questions. 

The process is simple. First, teachers must grade a whole set of 

essays . Then these teachers must select a random sample of those 

essays , perhaps 25% in a class of 50 - 60 people, and simple 

regrade — those essays. When doing this regrading, of course. 

teachers m ust not allow themselves to know what score they gave a 
paper the first time it was graded. Then after the sample has been 

graded , a simple correlation analysis is done of the grades given 

on the two separat e occasions. (Correlation analysis can be done 
instantly bv an y major spread sheet program.) If the correlation 
between the first set of scores and the second is relatively high. 
with 1.00 bei ng perfect, then chances are the teachers is grading 

iil® — essays in a fairly consistent manner. Conversely, if the 

correlation between the first set of scores and the second set of 

scores is lo w, with 0.00 between pure chance, then chances are that 

the teacher is grading the essays in a fairly inconsistent manner. 

Consistent grading, of course, increases test reliability whereas 

inconsistent grading decreases test reliability. 



For a di scussion of this process in connection with bar exam. 

se_e Lenel , "The Essay Examination Part III: Grading the Essay 

Examination," 59(3) Bar Examiner 16, 23-24 (1990) . 

2JL. Ebel , supra, at 80 et seq. ; Gronlund, supra, at 87 et sea.: 

Mehrens , supra at 251-257 : Popham, supra at 136 et sea, and 152 et 

seq.; Sax, supra, at 275 et sea. 

29 . Mehrens, supra at p. 260. 

liL- Note no w an important point. Critics of the foregoing 
analysis migh t note that the impact of luck will be such that anv 
one student's reporte d score is likely to be the same distance 
above — or below tha t student's true score as any other student's 

reported score is likely to be above or below that other student's 

true score . Thus, these critics will conclude, reported scores can 

safely be used as "stand ins" for true scores. In one important 
sense this argument has considerable merit. If teachers use a test 
to evaluate the performance of an entire group of students, then 
the reported scores on that test can in fact stand in for the true 
scores. For the group as a whole, luck will in fact cause the 
positive differences that exist between some students' true and 
reported scores to cancel out the negative differences that exist 
between other students' true and reported scores. 



4 



52 



